Impact of fully automated assessment on interstudy reproducibility of biventricular volumes and function in cardiac magnetic resonance imaging

Cardiovascular magnetic resonance (CMR) imaging provides reliable assessments of biventricular morphology and function. Since manual post-processing is time-consuming and prone to observer variability, efforts have been directed towards novel artificial intelligence-based fully automated analyses. Hence, we sought to investigate the impact of artificial intelligence-based fully automated assessments on the inter-study variability of biventricular volumes and function. Eighteen participants (11 with normal, 3 with heart failure and preserved and 4 with reduced ejection fraction (EF)) underwent serial CMR imaging at in median 63 days (range 49–87) interval. Short axis cine stacks were acquired for the evaluation of left ventricular (LV) mass, LV and right ventricular (RV) end-diastolic, end-systolic and stroke volumes as well as EF. Assessments were performed manually (QMass, Medis Medical Imaging Systems, Leiden, Netherlands) by an experienced (3 years) and inexperienced reader (no active reporting, 45 min of training with five cases from the SCMR consensus data) as well as fully automated (suiteHEART, Neosoft, Pewaukee, WI, USA) without any manual corrections. Inter-study reproducibility was overall excellent with respect to LV volumetric indices, best for the experienced observer (intraclass correlation coefficient (ICC) > 0.98, coefficient of variation (CoV, < 9.6%) closely followed by automated analyses (ICC > 0.93, CoV < 12.4%) and lowest for the inexperienced observer (ICC > 0.86, CoV < 18.8%). Inter-study reproducibility of RV volumes was excellent for the experienced observer (ICC > 0.88, CoV < 10.7%) but considerably lower for automated and inexperienced manual analyses (ICC > 0.69 and > 0.46, CoV < 22.8% and < 28.7% respectively). In this cohort, fully automated analyses allowed reliable serial investigations of LV volumes with comparable inter-study reproducibility to manual analyses performed by an experienced CMR observer. In contrast, RV automated quantification with current algorithms still relied on manual post-processing for reliability.

www.nature.com/scientificreports/ to observer variability [7][8][9][10] . Despite efforts directed towards automation of volume and mass assessments, most approaches require manual preparation and preselection of CMR images 11,12 . More recently, novel artificial intelligence (AI)-based deep learning algorithms were introduced which allow for fully automated post-processing of LV mass and biventricular volumes showing promising initial results including risk stratification following acute myocardial infarction 13,14 . Data on interstudy reproducibility is of high clinical importance when it comes to follow-up surveys. Observer experience and variability may significantly impact the identification of subtle clinical changes between exams 10 . Hence, the current study aimed to assess the impact of fully automated assessments on inter-study variability and reliability in comparison to an experienced and inexperienced observer to define the current potential and limitations of fully automated post-processing.

Methods
Study population. The study population consisted of 18 participants which were scanned twice at a median interval of 63 days (range 49-87) using a standardized imaging protocol for anatomy and function 15,16 . All participants were in stable sinus rhythm during image acquisition. A minimum of 6 weeks between the first and second scan was required to avoid recollection bias of the involved CMR staff. Care was taken that acquisitions were performed at the same levels of the heart. Care was taken that no change in symptoms and medication occurred in patients with heart failure. Furthermore, new onset of cardiac disease was excluded in healthy subjects. The study was approved by the Ethics Committee of the Charité-University Medicine Berlin and was conducted according to the principles of the Helsinki Declaration. All participants gave written informed consent before randomization. The study was supported by the German Centre for Cardiovascular Research (DZHK). Manual volumetric assessments were performed in SA orientations according to standardized recommendations 17 by an experienced CMR operator (observer A, cardiologist, 3 years of CMR experience) and an inexperienced operator (observer B, trainee in cardiology, no experience in reporting or CMR segmentation), who was trained 45 min by the experienced observer with five cases from the SCMR consensus data 18 . Long-axis views (4-chamber and 2-chamber) were crosslinked to define RV and LV basal segments. Dedicated commercially available post-processing software was employed for manual assessments (QMass, Version 3.1.16.0, Medis Medical Imaging Systems, Leiden, The Netherlands). Fully automated analyses were performed in SA stacks with suiteHEART (Version 4.0.6, Neosoft, Pewaukee, WI, USA), Fig. 1. Papillary muscles were included within the myocardium. Fully automated analyses were not manually post-processed or validated, manual segmentations were not supported by any semi-automated processing e.g. threshold or edge detection. All operators were blinded to their previous as well as each other's results. Volumetric analyses comprised LV mass, LV and RV end-diastolic/systolic (EDV/ESV) volumes as well as stroke volumes (SV) and EF. Interstudy agreements were evaluated for manual assessment of observer A, manual assessment of observer B as well as fully automated analyses.
Statistical analyses. Statistics were calculated using IBM SPSS Version 24 for Windows (IBM, Armonk, NY, USA) and Microsoft Excel. Continuous parameters are reported as mean and corresponding standard deviation (SD), changes from Exam 1 to 2 were evaluated using the Wilcoxon signed-rank test for dependent continuous parameters. An alpha level of 0.05 and below was considered statistically significant. Inter-study and interobserver variability was assessed using intra-class correlation coefficients (ICC) based on absolute agreement (excellent ICC > 0.74, good between 0.60 and 0.74, fair between 0.4 and 0.59 and poor below 0.4) 19 , the coefficient of variation (CoV, SD of mean difference divided by the mean (SD (MD))/mean) as well as Bland-Altman plots [mean difference between measurements with 95% confidence interval (CI)] 20 . Intra-observer reproducibility of the automated algorithm has been addressed previously yielding ICC = 1 and CoV 0% 13 . Sample sizes were calculated for the detection of absolute changes of 10 g LV mass, 10 ml LV and RV EDV/ESV/SV as well as 5% change in LV/RV-EF for a power of 80% and an α-error of 0.05 using the formula n = f (α, P) * σ 2 * 2 δ 2 where n = sample size, f = factor taking α (level of significance) and P (study power) into account (f = 7.85 for α = 0.05 and P = 0.8), σ = interstudy standard deviation of the mean difference between Exam 1 and 2 and δ the magnitude of differences to be detected 5,6 .

Results
Study population. The study population consisted of 18 participants, 11 with normal biventricular function and 7 with heart failure, the latter including 3 patients with heart failure and preserved (HFpEF) and 4 patients with reduced (HFrEF) ejection fraction. The mean age was 46 years with a SD of 23. Ten participants were male and 8 female. All SA stacks were assessed by observers A and B as well as by the fully automated software algorithm. Results for LV and RV volumes are reported in Table 1. LV volumes and function were not significantly different between exams 1 and 2 for observer one and two as well as automated analyses. Statistically significant differences in RV volumetry were observed for observer A and the automated software algorithm reported in Table 1. Manual post-processing took on average 8.5 ± 1.7 min and 13.2 ± 2.8 min for the experienced and inexperienced observer, as opposed to automated analyses with < 1 min/SA stack. www.nature.com/scientificreports/ Reproducibility. For interstudy reproducibility, mean differences as well as corresponding SD, ICC and CoV of LV and RV volumes are reported in Table 2, corresponding Bland-Altman plots are displayed in Figs. 2, 3 and 4. LV reproducibility was overall excellent (ICC 0.86-1.00), best for observer A (ICC > 0.98), followed by fully automated analyses (ICC > 0.93) and observer B (ICC > 0.86). Interstudy reproducibility of RV volume was excellent for observer A (ICC > 0.88), good to excellent for automated analyses (ICC 0.69-0.92) and fair to excel-  www.nature.com/scientificreports/ lent for observer B (ICC 0.46-0.95). Similarly, lowest interstudy variability was found in LV volumes for observer A (CoV < 9.6%) followed by fully automated analyses (CoV < 12.4%) and observer B (CoV < 18.8%). Regarding RV analyses, lowest interstudy variability was found for observer A (CoV < 10.7%) whilst fully automated analyses (CoV < 22.8) as well as observer B (CoV < 28.7%) demonstrated considerable inter-study variability. For interobserver reproducibility, mean differences as well as corresponding SD, ICC and CoV of LV and RV volumes are reported in Table S1 (supplementary material) comparing automated with experienced and inexperienced manual analyses as well as comparing experienced and inexperienced manual analyses, showing overall excellent interobserver reproducibility of LV analyses (ICC 0.92-0.99) and fair to excellent reproducibility of RV metrics (ICC 0.43-0.97). Fully automated LV analyses shower better agreement with experienced than with inexperienced analyses. The automated algorithm overestimated RV EDV (mean difference 12.9 ± 13.8 ml/m 2 ) and RV ESV (mean difference 10.5 ± 12.7 ml/m 2 ) as compared to the experienced observer, while underestimated RV ESV (mean difference − 14.8 ± 9.0 ml/m 2 ) as compared to the inexperienced observer.
Sample size calculations. Sample sizes required for the detection of absolute changes in volumetric indices (10 g mass, 10 ml in volume or 5% EF) are reported in Table 3. Sample sizes were smallest for observer A, followed by fully automated analyses and largest for observer B. Whilst samples sizes of automated analyses for LV volumes were similar to those of observer A, sample sizes of automated analyses for RV volumes were similar to those of observer B. LV volume sample sizes ranged between n = 5 for LV mass and n = 11 for ESV for observer A, between n = 6 for EF and n = 32 for EDV for automated analyses and between n = 19 for EF to n = 89 for EDV for observer B. RV volume samples sizes ranged between n = 6 for ESV and n = 9 for SV for observer A, between n = 27 for ESV and n = 77 for EDV for automated analyses and between n = 42 for ESV and n = 73 for SV for observer B. Table 2. Interstudy reproducibility. Interstudy variability for manual (experienced and inexperienced observer) as well as for fully automated assessments. LV mass is reported in gram, volumes in ml and EF in %. SD standard deviation, ICC intraclass correlation coefficient, CoV coefficient of variation, LV left ventricular, RV right ventricular, EDV end-diastolic volume, ESV end-systolic volume, SV: stroke volume, EF ejection fraction. www.nature.com/scientificreports/

Discussion
The present study evaluates the interstudy variability of LV mass as well as LV and RV volumes quantified using a fully automated post-processing algorithm. Concerning LV analyses, the results demonstrate similarly high interstudy reproducibility of fully automated analyses as compared to an experienced CMR observer and show superior performance of fully automated analyses as compared to an inexperienced observer. In contrast, reliability of automated RV analyses is notably lower as compared to an experienced CMR observer. CMR imaging represents the reference standard for the assessment of cardiac morphology and function due to a precise evaluation of bSSFP SA stacks covering the entire LV and RV 1 . However, in many departments CMR examinations are still not easily available since MR scanners are not always dedicated to CMR and consequently examinations and post-processing of the images are relatively time-consuming compared www.nature.com/scientificreports/ to other examinations. As a result cost-effectiveness is lower compared with competing methodology such as echocardiographic approaches even though CMR diagnostic information can often be considered of higher value 7,9 . Notwithstanding, mounting evidence emphasizes the need of CMR surveys in an increasing number of cardiac diseases 21 . To achieve high quality diagnostic examinations experience and training are important with a distinct effect on volumetric analyses and are particularly required in challenging anatomic conditions, e.g. patients with congenital heart disease 10,22 . User-independent fully automated assessments have been introduced for the evaluation of biventricular volumes showing promising results 11 . Machine learning and AI-based algorithms 23 may indeed complement varying levels of user experience. Furthermore, process efficiency may be strengthened considering SA stacks volumetric analyses may be already performed parallel to scanning e.g. Recently, automated analyses demonstrated feasibility and equally predictive prognostic value in 1017 patients following acute myocardial infarction compared to conventional analyses by trained and experienced medical personal 14 . Several previous studies applying the proposed automated algorithm showed consistently high interobserver reproducibility with experienced CMR observers 13,14 . The feasibility and reliability of automated LV analyses in clinical routine imaging is further underlined by the present data demonstrating high interstudy reproducibility. Future applications may expand to automated tissue characterisation e.g. scar quantification 14 as well as deformation imaging 24 . Deformation imaging has gained recognition for enhanced risk prediction beyond conventional volumetric derived functional analyses, e.g. following acute myocardial infarction 25 as well www.nature.com/scientificreports/ as ischemic and non-ischemic cardiomyopathy 26 . However, ongoing discussions about the reproducibility of deformation based approaches 27 and limited data from large clinical train still hamper its unrestricted clinical use. At the current time, cardiac volumetric analyses still remain the gold-standard for quantitative functional assessments, despite its inability to assess regional function. Guidelines for clinical decision making are inevitably based upon thresholds 28 . In certain clinical scenarios, decision making heavily relies on changes between serial examinations e.g. recovery of LVEF following acute myocardial infarction to evaluate implantable cardioverter defibrillator (ICD) therapy 3 . Serial examinations rely on the assumption that changes in cardiac mass and volumes are reliably detectable. However, most CMR imaging laboratories employ several CMR operators, often with different training experience, resulting in potential inter-observer variability if serial CMR examinations are analysed by different observers. This study confirms an overall excellent interstudy reproducibility for LV mass and volumes, best for manual assessments by an experienced observer and user-independent automated analyses and slightly lower for an inexperienced observer. Reproducibility of RV volumes was overall lower compared to LV metrics, which is in line with the available literature 6 . Whilst the experienced observer still achieved good to excellent reproducibility, variability between exams was high for the inexperienced observer. Automated assessments of RV volumes resulted in a slight improvement of reproducibility as compared to the inexperienced observer. We observed numerical differences for RV volumetry both for manual and automated analysis between the repeated exams. Even though they were statistically significant, their respective clinical relevance with a change of 2% in RV-EF should be interpreted with caution. On the other hand, defined cut-offs (e.g. for arrhythmogenic right ventricular cardiomyopathy (ARVC) end-diastolic volumes beyond 110 ml/m 2 for male and 100 ml/m 2 for female patients or an EF below 40% 29 ) require precise volume assessments. Thus, inaccuracies in RV volume assessments bear potential clinical consequences. The present data support current evidence that precise and correct quantifications of RV metrics remain challenging and still require dedicated training which is probably due to the more complex anatomy of the RV as compared to the LV 10,30 . Because a strong link between RV functional but not structural changes with prognosis following acute myocardial infarction has been demonstrated 31 , the field of automated RV assessment and required analysis refinement and improvement warrants further investigation.

Limitations
Sample size calculations and derived conclusions are based on n = 18 participants. Although reports indicate low sample sizes in CMR volume assessments 5 , statistical evaluations and generalisation may be limited. Detailed specifications of the automated algorithm that incorporates AI and deep learning models developed by the manufacturer are not disclosed; therefore, they cannot be described more precisely. The results of the study therefore apply to this specific cohort. Without knowing the exact types of scans used in the software's training, it might be difficult to extrapolate the results to other cohorts, which should definitely be addressed in larger future studies. Furthermore, it will be interesting to address whether or not the results can be extrapolated to patients with a more demanding anatomy (e.g. patients with congenital heart disease).

Conclusion
In this cohort, fully automated user-independent analyses allowed reliable serial investigations of LV volumes and function with comparably high interstudy reproducibility in relation to manual analyses performed by an experienced CMR observer. In contrast, fully automated RV assessments did not yet provide satisfying interstudy reproducibility and still require manual post-processing corrections by an experienced reader. Table 3. Sample size calculation. Sample size calculation for repeated measurements of left and right ventricular volumes for the detection of changes amounting to 10 g mass, 10 ml volumes and 5% EF respectively. EDV end-diastolic volume, ESV end-systolic volume, SV stroke volume, EF ejection fraction, SD standard deviation.