Introduction

Two-dimensional transthoracic echocardiography is widely used as a non-invasive, radiation-free, low-cost, and real-time imaging modality to assess cardiac function and cardiovascular disease diagnosis1. Cardiac images acquired based on specific probe positions and angles are referred to as standard views, which combined provide comprehensive information on cardiac structure and function2. High-quality standard views are the basis for reliable cardiac parameter measurements and accurate diagnosis3,4. However, compared with other imaging modalities, the acquisition process for clinical echocardiographic images is less automated, with echocardiographers manually adjusting probe positions and parameters, subjectively recognizing individual standard views, and selecting high-quality image frames5. The entire process is cumbersome, time-consuming, highly dependent on the echocardiographer's experience and maneuvers, and prone to inter- and intra-observer variability6,7,8. When the acquired views are of poor quality or key views are missing, the interpretation by human experts or artificial intelligence (AI) models is compromised. Current automated diagnostic frameworks for cardiac diseases do not yet incorporate image quality control (QC) into the analysis process, necessitating manual pre-filtering of low-quality images and thereby limiting clinical applicability9. Ideally, real-time QC should be implemented during patient image acquisition to ensure that the optimal images are captured within the same examination. Therefore, performing automatic view recognition and quality assessment during image acquisition or before downstream image analysis tasks is crucial. The former directly generates a high-quality image base, while the latter screens the optimal images for subsequent analysis.

Recently, deep learning has been widely used in echocardiographic image analysis, enabling automated clinical workflows for view classification, cardiac structure extraction, cardiac function quantification, and cardiac disease diagnosis10,11,12,13. View classification is the first step in echocardiographic analysis. Previous studies have proposed view classification models based on convolutional neural networks (CNNs), such as VGGNet and ResNet, achieving good recognition performance14,15,16,17. On-going research on image quality assessment mainly targets natural images and focuses on the various distortions of natural images during acquisition, compression, storage, and transmission18,19. Noise and artifacts are commonly found in ultrasound images due to the coherent interference of scattered waves20. However, unlike natural images, noisy ultrasound images are not always of low quality. The quality assessment of ultrasound images must also consider clinical practice requirements, emphasizing the visibility and integrity of specific anatomical structures21. Several studies have implemented quality assessment based on echocardiographic images, which are broadly categorized into three forms: categorical confidence, quality level classification, and quality score regression. Huang et al.22 and Zhang et al.23 used classification confidence for standard view recognition as the image quality score. The view classification confidence level represents the model's confidence in the predicted results but does not directly represent the image quality level from a clinical practice perspective. Zamzmi et al.24 proposed a MobileNetV2-s-based encoder-decoder network to recognize four standard views and classify them into two quality levels (good or poor). However, providing a continuous numerical score feedback to the operator during the image-acquisition process is more helpful than providing discrete quality-level feedback. Abdi et al.25 categorized the quality of end-systolic apical 4-chamber (A4C) view frames into six levels (0 to 5) based on the visibility and clarity of anatomical structures and proposed a nine-layer CNN for score regression prediction. Some studies were conducted based on echocardiography videos. Luong et al.26 set four quality score levels (0.25, 0.5, 0.75, and 1.0) based on the visibility of anatomical structures, and combined the DenseNet and LSTM networks to simultaneously achieve view recognition and quality assessment for nine echocardiographic videos. Labs et al.27 combined four convolutional layers and an LSTM network to assign scores ranging from zero to ten to four quality attributes in A4C and PLAX videos, respectively. These studies provide effective standard view quality assessment methods; however, several limitations remain. First, the current image quality assessment criteria are limited in scope and inadequate for clinically meaningful assessment of complex and diverse standard views. Second, most studies focus on a single task, with little research on a comprehensive multi-view classification and quality assessment pipeline. Third, traditional CNN architectures progressively abstract image information with deeper network layers, potentially losing spatial details at lower layers28.

In this study, we proposed four image quality attributes based on clinical practice needs and established evaluation criteria for six standard view categories accordingly. We developed a multi-task model that integrates view classification and image quality assessment into a unified framework. By sharing feature representations, multi-task learning enables the simultaneous learning of multiple related tasks in a single training step, effectively facilitating the exchange of information between tasks, thereby improving the overall performance and efficiency of the model 29,30. Furthermore, we introduced the Feature Pyramid Network (FPN)31 into echocardiographic image quality assessment for the first time to achieve the fusion and utilization of multi-scale features.

Methods

An overview of the study is provided in Fig. 1. A multi-task deep learning model was trained on a dataset consisting of 170,311 echocardiographic images to automatically generate view categories and quality scores for clinical quality control workflows. This study conformed to the principles outlined in the Declaration of Helsinki and was approved by the Ethics Board of our institution (No. 2023–407).

Fig. 1
figure 1

Overview of the study design. (a) Seven types of echocardiographic standard views were collected, including apical 4-chamber (A4C), parasternal view of the pulmonary artery (PSPA), parasternal long axis (PLAX), parasternal short axis at the mitral valve level (PSAX-MV), parasternal short axis at the papillary muscle level (PSAX-PM), parasternal short axis at the apical level (PSAX-AP), and other views(Others). (b) Four quality attributes were summarized, including overall contour, key anatomical structural details, standard view display, and image display parameter adjustments. (c) Model development workflow for data collection, data labeling, data preprocessing, and model training. (d) Two clinical application workflows: the left side shows real-time quality control during image acquisition, and the right side shows pre-stored image screening prior to AI-assisted diagnosis. Artwork attribution in (c) and (d): www.flaticon.com.

Data

The study is a retrospective study. A large number of echocardiographic studies were randomly extracted from the picture archiving and communication system (PACS) of the Sichuan Provincial People's Hospital between 2015 and 2022 to establish the experimental dataset, with all subjects aged 18 and above. Images showing severe cardiac malformations that prevented recognition of anatomical structures were excluded. The dataset consists of 107,311 echocardiographic images and includes six standard views commonly used in clinical practice: A4C view, parasternal view of the pulmonary artery (PSPA), parasternal long axis (PLAX), parasternal short axis at the mitral valve level (PSAX-MV), parasternal short axis at the papillary muscle level (PSAX-PM), and parasternal short axis at the apical level (PSAX-AP). Except for these six views, all other views are classified as “others”. For standard views with unevenly distributed quality levels, we performed undersampling to balance the data distribution. All images were acquired using ultrasound machines from different manufacturers such as Philips, GE, Siemens, and Mindray. The dataset was randomly divided into training (70%), validation (10%), and test (20%) sets through stratified sampling (Table 1). The distribution of quality scores for three subsets can be found as Supplementary Figure S1 online.

Table 1 Distribution of experimental dataset.

Quality scoring method

We established percentage scoring criteria for different standard views based on four attributes: overall contour, key anatomical structural details, standard view display (see Supplementary Fig. S2 online for an example), and image display parameter adjustments. Each attribute contributed to the score in a ratio of 3:4:2:1. Table 2 presents the scoring criteria for the PLAX view. Two accredited echocardiographers with at least five years of experience individually annotated all images in the dataset. The average of their annotations was used as the final expert score label. A third experienced cardiology expert, with over ten years of experience, conducted a review assessment of images with score differences of > 10. The “others” view was set to zero points for training purposes.

Table 2 PLAX view scoring definition.

Model development

The model architecture is shown in Fig. 2 and mainly consists of a backbone network, neck network, and two branch modules for view classification and quality assessment. The backbone network is used to learn and extract the multi-scale image features. We choose the output feature maps \(\left\{{S}_{2},{S}_{3},{S}_{4},{S}_{5}\right\}\) (with output sizes of 1/4, 1/8, 1/16, and 1/32 of the original resolution, respectively) from the last four stages of the backbone network as the input for the neck network. To obtain the best backbone network, we compared six different deep CNN architectures, namely, MobileNetV332, DenseNet12133, VGG1634, EfficientNet35, ResNet5036, and ConvNeXt37, and selected VGG16.

Fig. 2
figure 2

Proposed multi-task model architecture. The model consists of a backbone network, a neck network, a view classification branch, and a quality assessment branch. A single-frame image is input into the backbone network to extract features. Subsequently, the neck network enhances and fuses these multi-scale features. The highest-level feature from the neck network is fed into the view classification branch to get a view class, while the fused multi-scale feature is input into the quality assessment branch to generate a quality score.

The neck network serves as an intermediate feature layer for further processing and fusing the features extracted from the backbone network for the two subsequent tasks. The highest-layer feature,\({S}_{5}\), is a more discriminative high-level semantic feature that reflects the network's understanding of the overall context of the image and is suitable for classification tasks. Lv et al.38 proposed that conducting the self-attention mechanism on high-level features with richer semantic concepts could capture the connections between conceptual entities in an image. Therefore, to further enhance the expressiveness of the features, we input \({S}_{5}\) into a Vision Transformer Block (VTB)39 that unites a multi-head attention layer and a feedforward layer to facilitate intra-scale feature interaction. The feature map after this step is denoted as \({S}_{5}{\prime}\), which is applied to view classification. Subsequently, FPN is applied to fuse the features of the four scales \(\left\{{S}_{5}{\prime},{S}_{4},{S}_{3},{S}_{2}\right\}\) in a layer-by-layer manner from the top down for cross-scale feature interactions. We denote the set of feature maps output from the FPN as \(\left\{{P}_{5},{P}_{4},{P}_{3},{P}_{2}\right\}\), and each feature map has strong semantic information. Next, we fuse all scale feature maps using an Adaptive Feature Fusion Block (AFFB) to better model image quality perception. As shown in Fig. 3, the AFFB module first upsamples the feature maps at different scales to the size of \({P}_{2}\) and then concatenates them. Subsequently, the channel attention is calculated using the Squeeze-and-Excitation Block40 to adaptively adjust the importance of each channel feature. Finally, element-wise addition is performed on the features from each scale to generate the final fused feature map \(\text{F}\), which is used to perform the quality assessment task.

Fig. 3
figure 3

Adaptive Feature Fusion Block. The block integrates channel attention mechanisms to adaptively fuse feature outputs from the feature pyramid network at four scales, generating final quality-aware features for quality assessment.

For the view classification branch (VCB), a linear classifier is used to generate the view classification results. Simultaneously, a projection head is utilized to map the feature dimensions to a specified size to compute the Supervised Contrastive Loss41. The goal of supervised contrastive learning is to pull features of the same class closer together in the feature vector space while pushing the features of different classes apart. By applying supervised contrastive loss, we aimed to overcome the problem of small inter-class differences in echocardiographic images. For the quality assessment branch (QAB), a global average pooling is performed on feature map F to generate a K-dimensional feature vector, which is then fed to a multilayer perceptron (MLP) to fit and generate the final image quality score.

Model training

We jointly trained the model using the cross-entropy loss (for the view classification task), supervised contrastive loss, and mean squared error loss (for the quality assessment task). Additionally, to address the imbalance problem in multi-task training, an auto-tuning strategy42 was applied to learn the relative loss weights for each task.

The model was implemented in Python v3.8.12 using PyTorch v1.12.0 and was iteratively trained on two NVIDIA GeForce RTX 3090 GPUs, each with 24 GB of RAM. During training, the initial learning rate was set to 1e-5, and the batch size was set to 128. The Adam optimizer with a weight decay of 1e-5 was used. The input images were resized to 224 × 224, and pixel values were normalized to the range 0 to 1. No data augmentation was performed to prevent changes in image quality. An early stop strategy was used to stop training and reduce overfitting. The best model from the validation set was applied to the test set to evaluate the model performance.

Evaluation metrics

Five performance evaluation metrics, accuracy (ACC), precision (PRE), sensitivity (SEN), specificity (SPE), and F1 score (F1), were applied to validate the view classification performance. A confusion matrix was constructed to analyze the classification effect on different views. For quality assessment, Pearson’s linear correlation coefficient (PLCC), Spearman’s rank-order correlation coefficient (SROCC), mean absolute error (MAE), and root mean square error (RMSE) were used as evaluation indices. Indicators, such as the number of model parameters and inference time, were also considered to comprehensively evaluate the model performance. The Kruskal-Wallis test was employed to assess significant differences among the independent groups, with p < 0.05 considered statistically significant. For multiple comparisons, the Dunn-Bonferroni tests were applied. The Bootstrap analysis technique was utilized to calculate the 95% confidence intervals. Statistical analyses were conducted using SPSS v27.0 or Python v3.8.12.

Results

Evaluation of view classification task

The overall accuracy of the view classification task on the test set was 97.8% (95%CI, 97.7–98.0), with macro-average PRE, SEN, SPE, and F1 exceeding 94.8% (Table 3). The confusion matrix is shown in Fig. 4, which indicates that the model is prone to confusion when recognizing the three parasternal short axis views. The Grad-CAM maps in Fig. 5 reveal the image regions on which the model focuses when making classification decisions. It can be observed that the A4C view focuses on the mitral valve, tricuspid valve, ventricular septum, and atrial septum. The PLAX view focuses on the aortic and mitral valves, while the PSPA view focuses on the pulmonary artery wall and pulmonary valve. Additionally, the three similar parasternal short axis views effectively focus on key anatomical structural details, including the fish-mouth-like mitral valve orifice (mitral valve level), two sets of strong echo papillary muscles (papillary muscle level), and the annular left ventricular wall structure in three planes (apical level). To further show the robustness of the model, the Grad-CAM maps following data augmentation and under pathological conditions can be found in Supplementary Figures S3, S4 online, respectively.

Table 3 Performance of the view classification task.
Fig. 4
figure 4

Confusion matrix between different views. The Y-axis of the confusion matrix shows the true labels and the X-axis shows the labels predicted by the model. Reading across true-label rows, the two numbers above and below each box indicate the sample size and its percentage.

Fig. 5
figure 5

Grad-CAM maps for different views. Different colors indicate the activation level in different image areas. Red areas indicate high activation, while blue areas indicate low activation.

Evaluation of quality assessment task

The results of the quality assessment task on the test set are presented in Table 4. The average PLCC and SROCC values were 0.898 (95%CI, 0.893–0.902) and 0.893 (95%CI, 0.888–0.897), respectively, indicating a strong correlation between the model-predicted and expert subjective scores. The average MAE and RMSE values were 6.54 (95%CI, 6.43–6.66) and 9.42 (95%CI, 9.24–9.60), respectively, which are within the acceptable range relative to the label range of 0–100. The scoring effect of the samples for each view is shown in Fig. 6. It can be seen that there is a significant image quality improvement as the score increases.

Table 4 Performance of the quality assessment task.
Fig. 6
figure 6

Examples of the test results for the six standard views. The orange value in each panel represents the expert scores, while the green values are the prediction scores for the proposed method.

Effect of different backbones and additional modules on the proposed method

The performance of the proposed method when implemented on different backbone networks is presented in Table 5. Compared with other CNNs, VGG16 achieved the best trade-off between accuracy, number of parameters, and inference time. To analyze the effectiveness of each module in the proposed method, the ablation experiments were conducted using the VGG16-based quality assessment model (single task) as the baseline. As shown in Table 6, with the sequential addition of the neck network, view classification task, and supervised contrastive loss modules, the performance of our model was significantly improved. Furthermore, we conducted a comparison to assess the impact of including or excluding the "others" view on model performance.

Table 5 Comparison of different backbone networks with our method.
Table 6 Improvement of the baseline with additional modules.

Application of the proposed method for echocardiographic image quality analysis

To verify the feasibility of our proposed method for standard view quality assessment, we compared the archived image quality among three groups of echocardiographers with different levels of experience (3 juniors, 3 seniors, and 3 experts). The junior group has 1–2 years of experience, the senior group has 4–5 years of experience, and the expert group has over 10 years of experience. The distribution of manufacturers among the three groups of echocardiographers was relatively balanced. We hypothesize that the image quality from the expert group is higher compared to the other groups. Images collected by nine echocardiographers between July and December 2023 were predicted using the proposed model. The subjects were males aged 18–40 years, without obvious cardiac structural or functional abnormalities. Based on the predictions, 6000 images were randomly selected from each echocardiographer, comprising 1000 images per view type. In total, 54,000 images from nine echocardiographers, covering six standard views, were used for statistical analysis.

The Kruskal-Wallis test indicated that there was a significant difference in quality scores across the three groups of echocardiographers on each view (p < 0.001). The box plot further illustrates the distribution of quality scores for the three groups across six standard views (Fig. 7). After adjusting for multiple comparisons, the median quality score of the expert group was higher than that of both the senior and junior groups for each view (p-adj < 0.001). Except for the PSPA view, the median quality score of the senior group was higher than that of the junior group (p-adj < 0.001).

Fig. 7
figure 7

Distribution of quality scores by view and group. The box plot provides a visual representation of the distribution differences between different groups (junior, senior, and expert), detailing statistical measures such as the minimum, first quartile, median, third quartile, and maximum.

Discussion

In this study, we developed and validated a multi-task model that simultaneously performs view recognition and percentile image quality assessment for seven types of views. The rationale for integrating these two tasks into a single model is that they are interrelated, as the view type determines the focus of the quality assessment. The results of the ablation experiments show that it is feasible to train a generic model to extract features from different echocardiographic views for quality assessment. Furthermore, introducing view classification as an auxiliary task can provide additional support for feature learning, which improves the quality assessment performance. The model performs well on both tasks. For the view classification task, misclassifications were mainly focused on three parasternal short axis views, which were mainly attributed to the high similarity between their anatomical structures. However, guided by the supervised contrastive learning loss to learn distinctive feature representations, relatively accurate recognition results can be obtained. For the quality assessment task, the results show that the proposed model incorporating quality-aware features at different scales effectively learned the judgment criteria used by human experts in echocardiographic image quality assessment. Even for the PSAX-AP view, our proposed model achieved acceptable results with a small sample size.

Compared with previous methods, our study has some strengths. For quality assessment, we summarized four clinically significant quality attributes that ensure image quality scores are closely aligned with diagnostic value. We applied the model to analyze archived images from echocardiographers of varying experience levels and confirmed that those with higher levels of experience produced higher-quality images. This demonstrates the model's ability to analyze image quality from a clinical diagnostic perspective effectively. Regarding model design, Prior methods focused solely on high-level single-scale features and overlooked low-level details. To address this issue, we added a hierarchical neck network to perform multi-scale perception modeling at a low computational cost, simulating the human visual system's hierarchical processing of visual stimuli at different scales. The results indicate that the quality assessment task significantly improved by adaptively integrating high-level semantic information with low-level detailed information through the neck network. From a clinical application perspective, our study utilizes echocardiographic images, rendering it more pertinent to real-world practice compared to video-based studies. The proposed model accurately captures static images at each moment, avoiding the complexity and computational costs incurred by using a 2D + t model to process dynamic video data. Additionally, our dataset encompasses six standard views and classifies all other views into an "others" category, enabling the model to directly differentiate between the six target views and other views. The results show that the introduction of the “others” view increases data diversity and improves the view categorization effect, with only a minor compromise on the quality assessment effect. In contrast, the model proposed by Luong et al.26 does not include an "others" category and relies solely on a confidence threshold to classify images: images with confidence below the threshold belong to the "others" category, while those with confidence above the threshold belong to one of the target views. However, since lower-quality target views may exhibit lower confidence, setting an effective threshold to distinguish them from "others" is quite difficult. Particularly in AI-assisted diagnosis, misclassifying other views as the required standard views can significantly affect diagnostic accuracy.

The proposed multi-task model effectively reduces the deployment pressure by merging the two tasks, achieving a good trade-off between accuracy and inference time. After deployment on a 3090 GPU, the model required no more than 2.8 ms to process a frame of 224 × 224 pixels. The model can be developed as part of a QC system that is applicable in several clinical scenarios. In echocardiography training, immediate feedback regarding the types of views and quality scores can help novice operators master the technical essentials of echocardiography more quickly and alleviate the shortage of faculty in underdeveloped areas43. During echocardiographic examinations, the system can assist operators in standardizing imaging and monitoring the progress of view acquisition, reducing measurement variability, and improving diagnostic quality44,45. Furthermore, the system can perform post-analysis on large-scale stored images or serve as a preprocessing step for AI-assisted diagnostic systems, selecting high-quality, interpretable cardiac ultrasound images from pre-stored data.

Our study has several limitations. First, the standard views covered are limited, and commonly used apical series views such as the apical 2-chamber and apical 3-chamber need to be further incorporated. Since our method does not impose specific constraints on view selection, it can theoretically accommodate additional standard echocardiographic views. Second, our method generates an overall quality score for echocardiographic images, and the individual scoring of different quality attributes should be explored in future research. Third, although the model was developed with a diverse dataset, its robustness and reliability still necessitate further validation in real-world clinical settings.