A task-unified network with transformer and spatial–temporal convolution for left ventricular quantification

Quantification of the cardiac function is vital for diagnosing and curing the cardiovascular diseases. Left ventricular function measurement is the most commonly used measure to evaluate the function of cardiac in clinical practice, how to improve the accuracy of left ventricular quantitative assessment results has always been the subject of research by medical researchers. Although considerable efforts have been put forward to measure the left ventricle (LV) automatically using deep learning methods, the accurate quantification is yet a challenge work as a result of the changeable anatomy structure of heart in the systolic diastolic cycle. Besides, most methods used direct regression method which lacks of visual based analysis. In this work, a deep learning segmentation and regression task-unified network with transformer and spatial–temporal convolution is proposed to segment and quantify the LV simultaneously. The segmentation module leverages a U-Net like 3D Transformer model to predict the contour of three anatomy structures, while the regression module learns spatial–temporal representations from the original images and the reconstruct feature map from segmentation path to estimate the finally desired quantification metrics. Furthermore, we employ a joint task loss function to train the two module networks. Our framework is evaluated on the MICCAI 2017 Left Ventricle Full Quantification Challenge dataset. The results of experiments demonstrate the effectiveness of our framework, which achieves competitive cardiac quantification metric results and at the same time produces visualized segmentation results that are conducive to later analysis.


A task-unified network with transformer and spatial-temporal convolution for left ventricular quantification
Dapeng Li 1 , Yanjun Peng 1,2* , Jindong Sun 1 & Yanfei Guo 1 Quantification of the cardiac function is vital for diagnosing and curing the cardiovascular diseases.Left ventricular function measurement is the most commonly used measure to evaluate the function of cardiac in clinical practice, how to improve the accuracy of left ventricular quantitative assessment results has always been the subject of research by medical researchers.Although considerable efforts have been put forward to measure the left ventricle (LV) automatically using deep learning methods, the accurate quantification is yet a challenge work as a result of the changeable anatomy structure of heart in the systolic diastolic cycle.Besides, most methods used direct regression method which lacks of visual based analysis.In this work, a deep learning segmentation and regression task-unified network with transformer and spatial-temporal convolution is proposed to segment and quantify the LV simultaneously.The segmentation module leverages a U-Net like 3D Transformer model to predict the contour of three anatomy structures, while the regression module learns spatial-temporal representations from the original images and the reconstruct feature map from segmentation path to estimate the finally desired quantification metrics.Furthermore, we employ a joint task loss function to train the two module networks.Our framework is evaluated on the MICCAI 2017 Left Ventricle Full Quantification Challenge dataset.The results of experiments demonstrate the effectiveness of our framework, which achieves competitive cardiac quantification metric results and at the same time produces visualized segmentation results that are conducive to later analysis.
Cardiovascular diseases (CVDs) are the leading cause of death globally according to World Health Organization (WHO), about 17.9 million people died from CVDs in 2016, from CVDs, mainly from heart disease and stroke 1 .CVDs is a general term for a series of diseases caused by heart and blood vessels, such as coronary heart disease, stroke, heart failure, rheumatic heart disease, congenital heart defect, and arteriovascular disease.In recent years, with the rapid development of society and economy, people's lifestyles have undergone profound changes.Due to unheathy living habits, aged tendency population, and the continuous prevalence of the metabolic syndrome, the incidence of cardiovascular diseases is in a continuous upward stage.Cardiovascular diseases are currently showing a sudden and youthful trend, requiring timely detection and treatment of the disease.The heart is the most important organ of the human body, whose main function is to provide power for blood flow, transport blood to various parts of the body, and maintain normal metabolism and function of cells.The abnormality of the shape, volume and functional parameters of the heart is a sign of various CVDs.For example, an abnormal shape of the heart is a symptom of hypertrophic heart disease, abnormal volume is a characteristic of dilated cardiomyopathy, enlargement of left atrium and Right ventricle is a sign of rheumatic heart disease, the gradual decrease of left ventricular ejection fraction is an important feature of coronary heart disease.Therefore, monitoring the shape, volume and function of the heart through medical instruments has become the most important way to diagnose and treat cardiovascular diseases.In specific clinical applications, imaging equipment is used to obtain a patient's heart image.Imaging doctors annotate the anatomical structure of the heart, quantify the cardiac metrics, and provide assistance for the next step of diagnosis and treatment.
In order to provide support for the diagnosis and curing of the CVDs, considerable medical imaging technologies, including computed tomography (CT) and magnetic resonance imaging (MRI) are exploited.Cardiac MRI has a good contrast resolution of soft tissues, a large scanning field of view, and can obtain oblique crosssectional images in various directions and different angles.It has become the gold-standard for non-invasive and www.nature.com/scientificreports/non-radiative evaluation of cardiac structures and functions 2 .Left ventricle (LV) quantification indices such as end-diastolic internal meridian, end-systolic internal meridian and ejection fraction (EF) are the most important indicators for evaluating the cardiac function in clinical practise.Therefore, the accurate quantification of clinical cardiac functions is of great importance for helping early diagnosis and identification of CVDs.
In the clinical approach, LV function information relies on the manually laborious delineation of the LV epicardium and endocardium laborious by radiologists.Meanwhile, human assessment of LV function has changeable anatomy structure in systolic diastolic cycle and the laborious nature of a calculation that hard to trace 3 .So with regard to LV quantification, although many efforts have been devoted to find automatic or semiautomatic methods to solve above problems, the following challenge issues should be addressed for robust and accurate LV quantification: (1) the variability of cardiac ventricle in shape and appearance in whole cardiac cycle frame sequences due to different pathologies.(2) the low contrast anatomy structures, in-homogeneity brightness and texture in MRI 4,5 .
Doctors are used to draw the structural contour of cardiac LV cavity and LV myocardium manually in early clinical practice, they use the segmented contour to obtain the reliable quantification.However, due to the large number of cardiac images, this process is still time-consuming and tedious.Therefore, exploring automated methods to reduce the laborious work of radiologists and increase the precision of quantification is of great importance.Two categories methods have existed in left ventricular quantification domain, those are the indirectsegmentation based method and the direct-regression method (as depicted in Fig. 1).Although these models have showed great performance in cardiac LV quantification, both of the above two methods have advantages and disadvantages.By integrating segmentation module and regression module into a uniform platform will help the framework to exploit more robust feature representations and achieve precise quantification results.Considerable of methods have been introduced in cardiac quantification field, Xue et al. 2 proposed a Bayesian neural network incorporate the Monte-Carlo dropout for deep feature extraction, then they designed an uncertainty weighted loss function train the network.Du et al. 6 utilized a two step network which consists a segmentation network to achieve the contour of target and a regression network to quantify LV indices based on the previous segmentation results.Vesal et al. 7 first segmented cardiac LV contour using an encoder-decoder architecture network, and then introduced a multi-task framework that consists of regression task and classification task to achieve the final results.Ge et al. 8 raised a K-shaped Unified Network to direct segment and quantify LV simultaneously.Chen et al. 9 utilized dynamic analysis module, segmentation module, and quantification encoder module to make up a multi-task conditional learning model.
Although these elaborately designed approaches improve the generalization performance, some aspects of disadvantage should not be neglected.As to the muti-module network, the feature information from the segmentation path is not enough exploited, complex multi-module network is susceptible to degrade quantification performance as a result of the degrade segmentation performance.In this paper, a new end-to-end fully automatic deep learning segmentation and regression task-unified framework for LV segmentation and quantification is proposed.The task-unified model, which consists of a segmentation path and a regression path, help to represent origin image, learn multi-scale features and seize cardiac anatomy structural spatial-temporal information.Through this method, LV function can be acquired through the final regression learning network and provide clinicians with quantitative diagnosis.As such, the contributions of this work are summarized as follows: (1) A robust and effective task-unified framework to improve the performance of complete LV indices quantification, which includes two areas, three cavity sizes, six regional wall thicknesses, (2) Leverage the segmentation network to obtain visual segmentation results and provide reconstruct low noisy feature maps for regression network.(3) A combination multi-task loss is used to supervise the unified framework.
We conduct fivefold cross-validation experiment on the public MICCAI-2018 Left Ventricle Full Quantification Challenge (LV-Quan) dataset .Results of the cross-validation experiments demonstrate the competitive performance.The remainder of this paper is organized as follows.In "Related works" section, related works in cardiac ventricle quantification field is given."Methods and materials" section presents our proposed multi-task deep learning segmentation and regression unified framework architecture.The segmentation and quantification experimental results are detailed in "Experiments and results" section.Finally, the conclusion is presented in "Conclusions" section, and acknowledgement is presented in "Acknowledgements" section.

Related works
LV quantification methods.Indirect-segmentation methods segment the LV myocardium first and then quantify the cardiac indices.Direct-regression methods exploit the mapping relations between the cardiac MR images and cardiac indices directly.Owing to the powerful representation ability of neural networks, both of those methods have improved the performance for quantification of cardiac LV indices.
The indirect-segmentation based method is a two-step approach which the desired cardiac LV indices of the second step are measured based on the segmentation results of the first step.Most of the early LV quantification works 10,11 fall into this category.Classic image processing methods such as active contour 12,13 , level-set 14 , deformable model and prior knowledge have gained great development in the past decades 15,16 .Recently, convolution neural networks (CNNs) have showed impressive performance for segmenting cardiac LV by level set and deformable model [17][18][19][20] .Other deep neural network architectures introduced in cardiac segmentation field including parallel coarse-to-fine network 21 , grid-like CNN 22 , encoder-decoder architecture 23 , dilated CNN 24 , deep supervision 3D-CNN 25 , generative adversarial learning 26 , and shape prior knowledge 27 .Zhen et al. 28 used multi-scale deep neural network to learn hierarchical information initially and then put them into random forest to regression the cardiac LV indices.Furthermore, they proposed supervised descriptor learning to calculate four chamber volumes 29 .Wang et al. 30 leveraged an adaptive Bayesian method combining with shape features to estimate ventricular cavity volumes.The indirect-segmentation methods can offer not only the cardiac indices quantification results, but also the visualization results of the cardiac LV myocardium.However, in this category methods, it is a cascade approach which have only forward connection but no feedback from the second step.As a result, the unrepresentative extracted features will results unaccurate quantification results.
The direct-regression method for cardiac LV quantification has go through considerable development and recognition [31][32][33][34][35] .When the annotated groundtruth of image is not provided, direct methods-regression is a preferable method.This method can enable many effective analyze tools on cardiac MRI 28 .As direct architecture facilitates to seize more expressive LV information, the combination of feature representation and regression models are introduced.Luo et al. 36 estimated the cardiac volume by leveraging a multi-views fusion strategy in cardiac systole and end diastole cycle.Kabani et al. 37 used CNN to crop ROI, estimate volume from cardiac systole and end diastole cycle.Xue et al. 38 introduced the first end-to-end cardiac indices quantification framework.Additionally in 39 , they used a multitask neural network, which mapped the relations among cardiac LV indices and between tasks by Bayesian-based relationship learning.Although these methods demonstrated their effectiveness, there are still difficulties for the direct-regression methods to learn representative features due to highly variable cardiac anatomy structures.

Cardiac quantification indices.
The quantitative indices of the cardiac LV mainly include the six regional wall thickness of LV myocardium and LV cavity that describe anatomical structural information, and LV cavity and myocardium areas that used to calculate cardiac function parameters such as ejection fraction (EF).As demonstrated in Fig. 2. The cardiac metrics are strongly correlated with regional and cardiac function assessments.In 10 , the clinical roles of more cardiac indices are fully explained.Many existing methods focus on estimate the LV volume, which is simplified to the integral of the cavity area or is hard to quantify as a result of the high contrast.When multi-type cardiac quantification indices are estimated, more challenges would be arise.On the one hand, the cardiac quantification indices are different from each other in relation to the 2D spatial image www.nature.com/scientificreports/structure, so a more robust and relevant representation is needed for estimation.On the other hand, in terms of LV indices, regional wall thickness and myocardial area are suffer from the complex dynamic deformation of the myocardium, as well as the invisible cardiac ventricular epicardial edge.The regional wall thickness is also affected by the orientation of the myocardial.Thus the segmentation and regression paths should be able to sustain dynamic deformation, imperceptible boundary and direction changes 38 .The LV-Quan dataset was held in conjunction with the Statistical Atlases and Computational Modeling of the Heart (STACOM) workshop at MICCAI 11 , which created a foundation dataset for researches on cardiac LV quantification.

Methods and materials
Overview architecture of the our task-unified framework is presented in Fig. 3.The cardiac LV indices quantification adopts the idea from direct regression methods.However, the mapping relation between the input cardiac MRI and the ground-truth label indices is fuzzy, we introduce a task-unified framework that propagates structural feature information from the previous segmentation path to the regression path in multi-scale.This framework takes sequences of 5 slices as a 3 dimension input, the segmentation path outputs prediction of 5 slices while the regression path predict the groundtruth indices for the middle slice.The framework is beneficial in three aspects: (1) We incorporate temporal dynamics feature information from the neighbor slices, thus alleviating the segmentation predict task.( 2) Multi-scale structural image information from segmentation path enhance the ability of regression path for better cardiac LV quantification.In the following, details of the main components of the our framework are describe.
(3) The unified framework reduces over-fitting and provides not only segmentation results but also quantification results.
Segmentation path.To segment the LV cavity and LV myocardium from cardiac MRI, we employ a 3D  Regression path.To regress the LV indices, we introduce a task-unified spatial-temporal convolution architecture, which is trained in indirect and direct approach simultaneously.This regression path consists of 3D spatio-temporal convolution blocks, Recurrent Residual Attention Convolutional (RRAconv) blocks and fully connection (FC) layer.Many previous works have used 3D spatio-temporal convolution block to incorporate spatial information and temporal dynamic information [42][43][44] .We employ 3D RRAconv to 2D + time image frames to learn temporal dynamic information.Each RRAconv block contains two Recurrent Residual convolution and a SE channel attention module.According to our understanding, noise in the original Cardiac MRI affect accuracy of regression.Hence, we add skip connection between multi-scale structural image information in segmentation path and LV indices information in regression path to release the original noise and improve the accuracy of quantification.SE is used to adaptively concatenate information from the current regression path and corresponding information from the segmentation path.
The input tensor of the regression path is a size of k × h × w, where k is the number of slices that indicate temporal dimension, and h × w denotes the spatial dimension.Each 3D RRAconv block has Recurrent Residual convolution with kernels size of 3. ReLU activation and 3D batch normalization are used in this block.The spatial-temporal block is composed of two cascade 3D convolution layers, and follow by a 3D Max-Pooling layer.In the two two cascade 3D convolution layers, previous layer use 3 × 1 × 1 kernel convolution to capture temporal information and the latter layer leverage 1 × 3 × 3 kernel convolution with strides of 1 to learn spatial information.The following Max-Pooling layer use 1 × 2 × 2 kernel to decrease the feature maps along the spatial dimension and temporal dimension to regression LV indices only for central slice.ReLU activation and 3D batch normalization are also used in this block.We initialize the convolution layer kernels with the He initializer and apply weight regularization to reduce the over-fitting problem 45 .
Since the previous segmentation path, RRAconv block and spatio-temporal block have extracted excessive representation information form cardiac MRI, there is no need to design a more complex or deeper neural network for the final multi-task of regression and classification.Finally, two parallel branches are derived to complete the final multi-task.One shallow CNN branch used as a regressor to quantify wall thickness, dimensions, and areas, another branch is a fully connected layer which composed of 360 neurons multi-layer perceptron, and an output neurons with 2 neurons to classify the cardiac systole or diastole phase.

Loss function.
Based on the two path of task, in this work, multi-task needs to be addressed and loss function should be elaborately designed to supervise the unified network.Therefore, we leverage joint-task loss function for both LV segmentation, indices regression and phase classification.
For the segmentation path, to segment a cardiac MRI with having LV myocardium, LV cavity and background as labels.An objective function optimizer was introduced for precise segmentation and prompt the network to tackle highly class imbalance problem.We employ a loss function that combine the Dice loss and Cross-Entropy (CE) loss.The Dice loss function can improve the segmentation metrics, and the CE loss can increase the accuracy.Many works have combined these two loss functions to supervise the neural network, and achieved impressive performance 46 .Motivated by this, we also combine these two loss function to construct a new loss.Since Dice loss puts more emphasis on the overall similarity coefficient,we empirically set weight 1 = 1 and 2 = 1.5 to each of the two loss functions.The overall loss function can be seen in Eq. ( 1).
In the regression path, we minimize a combination of Mean Squared Error (MSE) and binary cross-entropy (BCE) loss over sets of k slices where groundtruth annotations are only offer for the middle slice.Given a set of k slices x i = ( x 0 , ...,x k−1 ), the label for the middle slice y i = ( y dim , y area ,y rwt ,y phase ) predictions of our model ŷ = ŷdim , ŷarea , ŷrwt , ŷphase the combination loss function is defined as Eq. (2).Equation (3) can be used to train the entire unified framework which consists of the segmentation path loss L Seg p ath and regression path loss L Reg p ath in an end-to-end approach.We have empirically set 3 = 4 and 4 = 1 as weights in Eq. ( 3) to weight importance and gradients of different task path.Since the regression path rely on the segmentation results, we give more weights to segmentation path task than to regression path task.This approach prompt the unified framework to output precise LV prediction. (1)

Experiments and results
We implement our framework with PyTorch and he experiments were carried out on one NVIDIA RTX 2080TI GPU.The experiment results are presented in the following sections.

Data and preprocess.
The data used in this study includes 2900 cardiac MRI of 145 patients 38 .Every subject, have mid-cavity 20 frames in one cardiac systolic diastolic cycle.These images are from three affiliated hospitals of two medical centers (London Medical Center and St. Joseph's Medical Center).The age of the subjects ranged from 16 to 97, with an average age of 58.9 years.The pixel spacing of MR images range from 0.6836 mm/ pixel to 2.0833 mm/pixel, with the mode of 1.5625 mm/pixel.The pathological types of the subjects are diverse, including regional wall motion abnormalities, myocardial hypertrophy, mildly enlarged LV, atrial septal defect, LV dysfunction, etc.In each frame of image, the LV has three equal parts, that is the basal, mid-cavity, and apical 47 .Before the experiments, several pre-processing approaches are employed by the challenge organizer, which including (1) Landmark labelling.(2) Rotation.(3) ROI cropping.( 4) Resizing.After this procedure, the images from different subjects are approximately aligned in size, orientation, and scale.Thus making the assessment independent of various pre-processing and allowing researchers to focus on the LV quantification.
In the ground-truth, LV myocardium epicardium and LV myocardium endocardium borders were manually labeled by radiologists.According to this border, we re-divide ground-truth into three category labels, those are being the LV cavity, LV myocardium and background.LV indices and cardiac phase is a great correlation with cardiac function metrics such as ejection fraction.The LV indices values are normalized by the dimension of the image or the pixel number.
We conduct five-fold cross-validation experiment on the LV-Quan dataset.We first use z-score normalization which based on the mean and standard deviation value and then employed data augmentations techniques including elastic random rotations transformation between − 90 and 90°, random horizontal and vertical flips transformation with chance of 50 percen, elastic deformations transformation, and gamma shifts transformation with the scope of 0.5 to 1.5.Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied to each training image slices to weaken the intensity inhomogeneity problem (as is shown in Fig. 4).In the training strategy, the segmentation model was trained with RAdam optimizer for 500 epochs with β 1 = 0.9 and β 2 = 0.999, along with weight decay value of 1E−4, and initial learning rate of 5E−4 exponentially decayed with parameter 0.99.The transformer module are pre-trained with ImageNet 48 .The regression model and classification model both using SGD optimizer with a learning rate of 5E−4, and with weight decay rate of 5E−3 and momentum parameter 0.06. (5)

Results.
The performance of out task-unified model is evaluated in terms of prediction accuracy of LV segmentation and LV indices quantification.Dice and Hausdorff Distance metrics are used to evaluate the performance of the LV segmentation.The Dice Coefficient metric is defined Eq. ( 4).Evidently, Dice(A,B) is maximized at 1 when A = B and minimized at 0 when A = B. where A and B are two sets.The Hausdorff Distance metric is defined in Eq. ( 5): where h(A,B) represents the distance from point A to point B. In addition, we leverage the mean absolute error (MAE) , Pearson correlation coefficient (PCC) and Error Rate to evaluate the regression path performance .They defined as Eqs.( 6), ( 7) and ( 8), where y indice is the the ground-truth label indices and ŷindice is the predicted value of indices by our proposed unified framework.Here, y indice and ŷindice is the mean value.ŷphase and y phase are the label annotation and predict class for the cardiac systolic diastolic phase.
We report the performance of our model below including performance of LV segmentation path and performance of LV quantification path.
Performance of LV segmentation path.Segmentation is one of our tasks and segmentation path is also used as a structural feature extractor for regression path.To verify that segmentation path can aggregate representative structural information and output predictions that most closely resemble the correct results.We use Dice and HD metrics to evaluate performance of our proposed segmentation model by comparing it with classic segmentation methods including UNet 49 , Densenet 50 , IndicesNet 38 , MC-Seg 51 , DRUNet 7 , Parallel 52 and SAUNet 53 .Dice Coefficient metric and HD metric are reported in the Tables 1 and 2. we can conclude that each method show  www.nature.com/scientificreports/competitive performance, our proposed method outperform other method in LV myocardium segmentation performance and the segmentation performance of LV cavity is better than that of LV myocardium.LV cavity and LV myocardium is the region of interest, which suffer shape variation during a cardiac systolic diastolic cycle and across different data subjects.It is difficult to recognize these two class labels, especially LV myocardium.
The qualitative segmentation predictions of our framework are showed in Fig. 5.The first row are input images and its corresponding groundtruth, the second row are the predictions from the network.The third row are the error between groundtruth and segmentation prediction, where blue region denotes over segmented and red region indicates under segmented.We also reproduce classic network to conduct study on segmentation results.Figures 6 and 7 show the analysis on the LV-Quan validation dataset and MICCAI 2009 Sunnybrook Cardiac left ventricle segmentation (LV-09) dataset of our method compared with other classic semantic segmentation networks, such as UNet 49 , Densenet 50 and IndicesNet 38 .Each model is trained for 500 epoches with a batch size of 20, supervised by same loss function and shares the same initial weight of CNN.The hyper-parameter configuration is shared by the selected models.The LV-09 dataset contains 45 cardiac cine-MR short axis (SAX) images from four different pathological groups.Each patient had manually drawn LV endocardium contours for ED and ES slices 54 .In this study, we segment the endocardium as binary boundary, to distinguish anatomical structure between LV and background.The comparative models share the same training strategy.In Fig. 6, for the rows from second to fifth, the Dice coefficient of LV cavity segmentation is 0.940, 0.954, 0.960 and 0,971.The Dice coefficient of LV myocardium segmentation is 0.869, 0.871, 0.880 and 0.921.Figure 7      www.nature.com/scientificreports/unified framework is also a time-efficient approach with competitive segmentation performance.Moreover, the number of parameters and computation cost GPU memory usage are the highest for IndiceNet, and the lowest for UNet.When compared to IndicesNet, Densenet uses relatively low parameters and GPU memory to achieve better time-efficient.Since our unified framework contains more convolutions and channels, our model have more parameters than UNet and Densenet, but our GPU memory usage is still relatively small when compared with the IndicesNet.
Performance of LV quantification path.Quantification of LV indices is the ultimate purpose of our work.We compare our method with the existing advanced methods (Max Flow 55 , MultiFeatures 29 , SDL 56 , Indices-Net 38 , FullLVNet 57 , DMTRL 39 , Indices-JSQ 6 and DRUNet 7 ) to evaluate the performance.We also add a comparative model to explore the performance difference between segmentation-based model and our task-unified model.The comparative model is the direct morphological calculation method (Calculation), which directly calculate these indices from the segmentation results, not using some simplified regression network.The Calculation model calculate the two Area-myocardium and Area-cavity indices by counting the number of pixels enclosed by endo and epicardium respectively, calculate the three Dim indices by casting a line from the centroid of LV cavity in IS-AL, I-A and IL-AS directions and measuring the distance between the intersections of the casted lines and the LV endocardium contour, and calculate the IS, I, IL, AL, A and AS by casting a line in six directions and measuring the distance between the intersections of the casted line and myocardium.The performance is illustrated in Table 4. Max Flow is is a multi-step model based on indirect-segmentation method, which LV quantification indices are calculated by the LV myocardial contour segmented first.The Max Flow method has high MAE of LV regional wall thicknesses metrics, but the PCC metrics are better than that of some direct methods.The reason is that, this method calculate LV indices by extracted contour, which results in a better mapping to label.The calculation method is also a indirect-segmentation method, which gets poor MAE and PCC performance compared with direct regression methods.Multi-features and SDL are two-step direct regression methods, they learn the cardiac image features first, and then use the representative features to quantify LV indices.
In Table 4, we can conclude that the two-step direct regression methods get poor performance not only in high MAE but also in correlation with the ground-truth.The poor representation ability of two-step methods result 16 mm for area, cavity dimension, and regional wall thickness.The average PCC values of area, cavity dimension, and regional wall thickness are 0.962, 0.978, 0.872 , respectively.Our task-unified network is an end-to-end manner, which incorporate the advantages of indirect and direct methods to improve segmentation predictions supervised by indices of label and generate more accurate quantification LV indices.We evaluate our unified framework on the testing data.Figures 8 and 9 show the normalized results of the quantification indices.The values of RWT, dimension and areas are normalized.Figure 10 illustrates the clinical metrics results of a randomly patient data subject compared with quantification metrics predicted by our task-unified model.In every image, the dotted line in the figure is the quantification prediction result of our task-unified model, and the solid line shows the metrics of groundtruth.Seen from the prediction results of clinical metrics, they are very close to the annotated label.The three figures of experimental result illustrate that our unified network achieves competitive performance in LV indices quantification.
To better understand the ability of feature extraction of transformer, we conduct ablation study by using two models, and visualize the prediction and output probability map in Fig. 11.One ablation model is our proposed unified network, another ablation model is a simplified version of our model which removed the transformer block.It can be seen that the prediction and output probability map of simplified model are more blurred, the segmentation path with transformer module predicts more concise results.Thus to prove the extraction ability of transformer.

Conclusions
In this study, we introduce a accurate and efficient deep learning segmentation and regression unified network to segment and quantify the LV simultaneously.The segmentation module leverage an U-Net like 3D Transformer model to predict the contour of three anatomy structures, while the regression module learned spatial-temporal representations from the original images and the reconstruct feature maps from segmentation path to estimate the finally desired quantification metrics.The three anatomy structures contains LV cavity, LV myocardium and background.The quantification metrics including the LV myocardial RTWs, dimensions, cavity and myocardium areas, and the cardiac diastolic or systolic phase.We used a joint-task loss function to supervise the two module networks training approach.Although the LV anatomical shape and appearance are highly variable across different subjects, our model achieves competitive performance in both segmentation and quantification approach.The unified network was evaluated on MICCAI 2017 LV-Quan dataset, and the experimental results prove the accuracy and efficiency of our model .In the future, we will verify our framework on more datasets to test the contribution in clinical approach.

Figure 1 .
Figure 1.Two categories methods have existed in left ventricular quantification domain.(a) Segmentationbased methods compute indices from the segmented result which requires strong prior information and user interaction.(b) Existing direct regression methods of cardiac indices quantification.When the labeled image is not available, direct methods-regression without segmentation step have grown in popularity in cardiac LV indices estimation.

Figure 3 .
Figure 3. Overview of the proposed unified framework which contains segmentation path and regression path.

( 8 )Figure 4 .
Figure 4.After pre-processing, we the stack the input image and its corresponding groundtruth to highlight the LV cavity and LV myocardium.From the figure, we can see the variation of shape, contrast and density in cardiac MRI.It is a great challenge for segmentation and quantification.

Figure 5 .
Figure 5. Example of predictions by our model for a random 20 frames: The rows of (a) and (d) are input images and corresponding myocardial, (b) and (e) are the myocardial predictions from our unified network.The rows of (c) and (f) are the difference of target myocardial area between ground truth and predicted segmentation, where red represents the under segmented regions and blue indicates the over segmented regions.

Figure 6 .
Figure 6.Example of comparative segmentation results on LV-Quan validation data.The first row are the raw input images and its corresponding ground truth.From the second to the fifth row indicate the predictions by comparative ablation models.

Figure 7 .Table 3 .
Figure 7. Example of comparative segmentation results on LV-09 testing dataset.The rows from top to bottom indicate the image slices from four pathological group: Heart Failure with Ischemia (HF-I), Heart Failure without Ischemia (HF-NI), Hypertrophic endocardium (HYP), Normal (NOR).The columns from left to right indicate predictions of comparative methods, GroundTruth and raw image data respectively.
time (s) Training time (h) Testing time (s) Params (M) GPU memory (G)

Figure 8 .
Figure 8.The LV quantification indices results predicted by our method on the testing data.

Figure 11 .
Figure 11.Visualization of prediction and output probability map.The rows marked W/O transformer indicates the model without using transformer module, while with transformer indicates model using transformer module.

Table 1 .
Dice scores for LV-Quan segmentation performance.Significant values are in bold.

Table 2 .
Hausdorff distance for LV-Quan segmentation performance.Significant values are in bold.

Table 4 .
Comparison with state-of-the-art methods of the quantification performance.MAE and PCC are shown in table.inhigh MAE and low PCC values.Indice-Net is an end-to-end manner foundation method to predict LV indices.Compared to Max Flow, Indices-Net gets a better area MAE metric but a poor regional wall thickness MAE metric.FullLVNet and DMTRL utilized RNN module to capture dynamic information which further improve the quantification results.The Indices-JSQ leveraged segmentation predictions to calculate the LV indices.DRU-Net introduced a multi-task learning approach to regress the cardiac LV indices.It can be seen in experimental results, direct methods outperform most indirect methods.Our method yields average MAE values of 132 mm 2 , 1.78 mm, 1.