Non-inferiority of deep learning ischemic stroke segmentation on non-contrast CT within 16-hours compared to expert neuroradiologists

We determined if a convolutional neural network (CNN) deep learning model can accurately segment acute ischemic changes on non-contrast CT compared to neuroradiologists. Non-contrast CT (NCCT) examinations from 232 acute ischemic stroke patients who were enrolled in the DEFUSE 3 trial were included in this study. Three experienced neuroradiologists independently segmented hypodensity that reflected the ischemic core on each scan. The neuroradiologist with the most experience (expert A) served as the ground truth for deep learning model training. Two additional neuroradiologists’ (experts B and C) segmentations were used for data testing. The 232 studies were randomly split into training and test sets. The training set was further randomly divided into 5 folds with training and validation sets. A 3-dimensional CNN architecture was trained and optimized to predict the segmentations of expert A from NCCT. The performance of the model was assessed using a set of volume, overlap, and distance metrics using non-inferiority thresholds of 20%, 3 ml, and 3 mm, respectively. The optimized model trained on expert A was compared to test experts B and C. We used a one-sided Wilcoxon signed-rank test to test for the non-inferiority of the model-expert compared to the inter-expert agreement. The final model performance for the ischemic core segmentation task reached a performance of 0.46 ± 0.09 Surface Dice at Tolerance 5mm and 0.47 ± 0.13 Dice when trained on expert A. Compared to the two test neuroradiologists the model-expert agreement was non-inferior to the inter-expert agreement, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p < 0.05$$\end{document}p<0.05. The before, CNN accurately delineates the hypodense ischemic core on NCCT in acute ischemic stroke patients with an accuracy comparable to neuroradiologists.


Introduction
Acute Ischemic stroke (AIS) is the number one cause of disability and a leading cause of mortality in the United States and worldwide 1,2 .AIS due to large vessel occlusion (AIS-LVO) carries the worst prognosis, but timely endovascular thrombectomy treatment leads to reduced death and disability 3 .AIS-LVO patient treatment decisions are guided by the presence and severity of the acute ischemic core, which is considered to be irreversibly injured [4][5][6] .The ischemic core is commonly assessed on computed tomography perfusion and diffusion-weighted imaging (DWI) magnetic resonance imaging (MRI).However, these imaging techniques are less widely available, and more generalizable means to identify and quantify the ischemic core on non-contrast head CT (NCCT) are needed.NCCT is the most commonly used imaging modality in AIS patients (>65%) given its widespread availability and low cost [7][8][9] .
Established semi-quantitative methods to assess an ischemic stroke on NCCT include the European Cooperative Acute Stroke Study (ECASS) 1 and Alberta Stroke Program Early CT Score (ASPECTS).ECASS defined a major infarct as involving more than 1/3 of the middle cerebral artery territory 10 , and ASPECTS evaluates 10 standardized regions within the middle cerebral artery territory and removes one point for the presence of hypodensity within each region.AIS-LVO patients with an ASPECTS ≥ 6 have been shown to benefit from thrombectomy in multiple studies 11,12 , and, more recently, AIS-LVO patients with an ASPECTS ≥ 3 have also been shown to benefit from thrombectomy [13][14][15] .ASPECTS is widely used, but it is limited by low reproducibility among raters and correlates only modestly to ischemic lesion volumes and symptom severity 12,16,17 .New imaging techniques that can identify and segment the ischemic core on NCCT with a more reliable inter-rater agreement would improve patient selection and identify AIS-LVO populations in need of further study to improve outcomes.
Supervised deep learning is a promising technique that has been successfully applied in medical image segmentation challenges, such as lesion segmentation on CT perfusion images of the brain 18 .Furthermore, benchmark deep learning models for out-of-the-box segmentation of diverse medical imaging datasets have been developed 19 and sparsely applied to ischemic stroke segmentation 20 .However, the low signal-to-noise ratio and ill-defined borders of the ischemic core on NCCT results in segmentation variability between experts 21 .This variability results in difficulty in defining the ground truth and in evaluating deep learning model performance against current segmentation methods (manual segmentation of experts) 22,23 .
We present a deep learning framework and evaluation process specifically designed for segmenting ischemic stroke lesions on NCCT scans.This framework allows us to not only compare the model's segmentation with ground truth segmentation of the test set 20,[24][25][26] , but to evaluate for non-inferiority when compared to two test experts.In this way, we may show that the model segmentations generalize to experts it was not trained with -measuring to which degree the model is consistent with the ischemic core as a biomarker with inherent variability between experts.
We hypothesized that a deep learning model trained against an experienced neuroradiologist may accurately identify and segment hypodensity that represents the ischemic core on NCCT.We also hypothesized that this trained deep learning model would segment the ischemic core non-inferiorly when compared to other neuroradiologists.We tested these hypotheses in NCCT studies of AIS-LVO patients enrolled in the DEFUSE 3 trial.

Ischemic Core Hypodensity Ground Truth Determination
The median volume of the ground truth on NCCT as determined by Expert A was 12 (IQR: 5-30)ml in the training set and 13 (IQR: 5-35)ml in the test set.Similar ischemic core volumes were determined by CT perfusion in the training set and in the test set (11 (IQR: 0-39)ml and 12 (IQR: 4-32)ml, respectively).

Evaluation of Model
On the test set, the final model trained on expert A achieved the following performance: Surface Dice at Tolerance 5mm of 0.46 ± 0.09, Dice of 0.47 ± 0.13, and absolute volume difference (AVD) of 7.43 ± 4.31 ml (Table 2, last column).We observed similar performance on the validation sets and present further details of weaker-performing models in supplementary Table ??).
To put the results of the final model into perspective, the predicted segmentations on the test set were then compared to the test experts B and C.
With the chosen metrics and lower boundaries, the model-expert agreement (model trained on expert A compared to expert B and C) is non-inferior to the inter-expert agreement (expert A compared to expert B and C) (Figure 4).For expert B, the model-expert is better than the inter-expert (Surface Dice at Tolerance of 5mm 0.63± 0.16 vs. 0.54 ±0.09, Dice 0.56 ±0.18 vs. 0.47 ±0.16).For expert C, the model-expert and inter-expert are similar and within the testing boundary (Table 2).
In addition, the volumetric inter-expert and model-expert agreements are visualized with scatter plots in Fig. 1 with Spearman Correlation Coefficient (R).The correlation between the predicted volumes of the model and the test expert is higher than that between experts (R=0.75 vs. R=0.74 for expert B (top row, blue lines), R=0.79 vs. R=0.63 for expert C (bottom row, yellow lines).Analyses on models that are trained on each of the other experts can be found in the Supplementary Tables ??, ??.

Discussion
In this study, a 3D Convolutional Neural Network (CNN) segmented the hypodense ischemic core on NCCT in a manner that was non-inferior compared to expert neuroradiologists.Our results are notable because the segmentation of acute ischemic stroke is a challenging task compared to less complex tasks in which deep learning methods have shown promise 27 .
In addition, the non-inferiority of the model across comparisons to multiple different expert neuroradiologists suggests that our results are generally applicable to the identification and quantification of the ischemic core on NCCT, and not limited to the emulation of any particular clinician.
These results have important implications for the care of patients with AIS-LVO.
Segmentation of the ischemic core on NCCT is challenging and suffers from high inter-expert variability.This variability results in significant difficulty in ischemic core segmentation and in the definition of a gold standard.Our results have important implications for artificial intelligence approaches to detect and quantify ischemic brain injury on NCCT.
The detection of cerebral ischemia and the ischemic core differs between commonly used imaging modalities, such as MRI (DWI), NCCT, and CT perfusion.This variability may result in differences in an imaging modality's ability to detect and localize the ischemic core.The sensitivity of NCCT for cerebral ischemia detection may be as low as 30% 28,29 .This variability hampers consistent evaluation of deep learning models and integration in clinical practice.
In order to create an optimal ground truth, prior work created a hypodense ischemic core lesion on NCCT from healthy patients by co-registering ischemic core lesions from DWI studies of acute ischemic stroke patients 30 .Other studies have also chosen DWI lesions as ground truth and co-registered to NCCT images from the same patient 26,31 .However, very few centers have large databases of patients with NCCT and DWI acquired within short time intervals to facilitate the development of CNN that uses the ischemic core on DWI as the ground truth.In addition, diffusion restriction is a unique phenomenon of DWI, especially in earlier time windows (< 1h) where cytotoxic edema is the predominant abnormality that is imaged 32 .Hypodensity on NCCT is generally felt to largely reflect vasogenic edema, which normally develops > 1 − 4h after stroke onset 33 suggesting irreversibly damaged brain tissue (ischemic core).We chose the ground truth segmentation based on the human reader with the most experience among expert neuroradiologists.Compared to related research, we report advancements in model development (Supplementary Table ??).We show significant non-inferiority through a comprehensive statistical analysis incorporating multiple performance metrics 20 .
Cell death in ischemic stroke is time-sensitive and happens on a continuous temporal scale, which results in a very difficult segmentation task even for experienced neuroradiologists in AIS-LVO patients.The CNN developed in this study demonstrated strong performance in the delineation of the ischemic core across multiple expert neuroradiologists, which suggests that this approach is likely to be generalizable in AIS-LVO patients.Future studies should test this hypothesis.In addition, our results have the potential to increase the consistency and quality of stroke assessment on NCCT in the emergency setting across hospitals where expert neuroradiologists might not be always available.
This study has limitations.First, the dataset originates from the DEFUSE 3 trial that randomized stroke patients presented within 6-16 hours.However, to diversify we also included non-randomized patients who did not meet the inclusion criteria (Table 1).Second, we included the manual segmentation of three experts.Since the concept of absolutely correct ground truth core segmentation ischemic stroke is not well-defined, more experts might be necessary for more accurate validation of results 21 .

Conclusion
A CNN was non-inferior to expert neuroradiologists for the segmentation of the hypodense ischemic core on NCCT.

Study Design and Data
This post-hoc analysis of the DEFUSE 3 trial included 232 AIS-LVO patients with NCCT who were either enrolled in the study or screened but not enrolled 5 .This multi-center (38 U.S. centers with obtained IRB approval) trial investigated thrombectomy eligibility for patients with acute ischemic stroke with an onset time within 6-16 hours (https://clinicaltrials. gov/ct2/show/NCT02586415). The patient cohort includes patients that met the inclusion criteria (symptom onset within 6-16h, anterior circulation, NIHSS ≥ 6) and patients that were excluded from randomization because of exclusion criteria (no LVO, within 6h of symptom onset).Further scanning parameters and details of the patient cohort are described in the original publication of the DEFUSE 3 trial 5 .All patients or their legally authorized representatives provided informed consent.Institutional review board approval from the Administrative Panel on Human Subjects in Medical Research at Stanford University was obtained for this study.All methods were performed in accordance with the relevant guidelines and regulations.

Ischemic Core Hypodensity Ground Truth Determination
Three experienced neuroradiologists from the USA and Belgium with 4, 4, and 9 years of clinical experience post-fellowship in diagnostic neuroradiology were instructed to outline abnormal hypodensity on the NCCT that was consistent with acute ischemic stroke within 6-16 hours of symptom onset.Segmentation was performed with the drawing tool in Horos (Horosproject.org,version 4.0.0).Experts had the option to not segment any tissue if no abnormal hypodensity was appreciated.Experts were blinded to all imaging other than the NCCT.For detailed instructions see the original instruction sheet (Supplementary Fig. ??).

Data Preparation and Partitions
The NCCT image and corresponding manual segmentation mask were resized to a common resolution of 22-56 x 512 x 512, resampled, and normalized using an existing preprocessing pipeline 19 .A mirrored rigid co-registered version of each input image was computed using SimpleITK to provide the model with symmetry information of the opposite hemisphere (https://simpleitk.org/) 30.Data augmentation was performed with the python package "batchgenerators" (version 2.0.0)including rotation, random cropping, gamma transformation, flipping, scaling, brightness adjustments, and elastic deformation 19 .
The data was divided into three steps.First, the experts were divided into training (expert A, ground truth) and test experts (experts B and C) by the amount of experience to approximate the most accurate ground truth for the model.Second, the cohort of 232 patients was randomly partitioned into 200 training and 32 test patients.Third, the training set was further split into five folds for cross-validation, with 160 patients for the training and 40 for the validation.Optimized model configurations were selected based on the result of fold 1 and further validated on folds 2 to 5 (Figure 3, Table ??).The highest-performing model from the 5-fold cross-validation, based on the Surface Dice at Tolerance at 5mm, was then evaluated on the test set.

Model Architecture and Training
A nnUNet was trained on the NCCTs with manual segmentations of the training expert A as reference annotations (Pytorch 1.11.0,Python 3.8, cuda 11.3).The model's input comprised a brain NCCT, along with an NCCT of the same patient.In the latter scan, the ipsilateral hemisphere is replaced by a mirrored version of the contralateral hemisphere.The output of the model was a segmentation mask 19 .The final nnUNet configuration includes a patch size of 28x256x256 and spacing of (3.00, 0.45, 0.45), 7 stages with two convolutional layers per stage, leaky ReLU as an activation function, Soft Dice + Focal 34 loss functions with equal weights, alpha of 0.5 and gamma of 2, a batch size of 2, stochastic gradient descent optimizer and He initialization (Supplementary Figure ??).We empirically set the epoch number to 350.In addition, we applied further regularization techniques such as L2 regularization, a dropout of 0.1, and a momentum of 0.85.Data augmentation included rotation, random crop, re-scaling, elastic transformation, and flipping.Please see supplementary Table ?? for a detailed discussion and analysis of technical procedures.

Metrics
The models were evaluated using a set of volume, overlap and distance metrics (for definitions see Supplementary Table  The best configuration choice was chosen based on the 'Surface Dice at Tolerance' with a tolerance of 5mm.The Surface Dice at Tolerance, also known as the Normalized Surface Dice, quantifies the separation between individual surface voxels in the reference and predicted masks.The tolerance establishes the maximum acceptable distance for surface voxels in the reference and predicted masks to be classified as true positive voxels.This metric is especially useful if there is more variability in the outer compared to the inner border, as is the case for ischemic stroke segmentation [35][36][37] .In this work, we chose the tolerance to be 5mm based on the average surface distance between experts.

Statistical Analysis
R (Version 2022.02.3) was used for statistical analysis.To evaluate the model performance for generalizability on unseen data, we measure to which degree the model segmentation on the test set is consistent with the ischemic core as a biomarker inherent to the variability between experts.For that, we compare the model segmentation against the test experts B and C (Figure 4).We used each metric to evaluate how close the model segmentations were to the test experts.
We used the one-sided Wilcoxon rank sign test (α = 0.05, n=32) of the median metric values upon a negative Shapiro test for normality.We chose the following non-inferior boundaries: • for relative metrics with values between 0 and 1: The model-expert agreement is no worse than 20% of the metric range compared to the inter-expert agreement.
• AVD: The model-expert absolute volume difference is at most 3 ml larger compared to the inter-expert agreement.
• HD 95: The model-expert maximum distance is at most 3mm larger compared to the inter-expert agreement.
We chose these boundaries based on the average difference in inter-expert agreements as a measure for variability (for metrics with a range of 0 to 1: 0.19, for metrics with SI Units: 2.53.This identifies whether model performance is comparable, within the normal bounds of variation, to experienced neuroradiologists 24 .This implies that the difference between the model-expert and inter-expert agreement is tested for being smaller than the average difference variability of agreement among experts to reach non-inferiority.
All p-values were adjusted for the total number of statistical tests presented in the paper using the Holm-Bonferroni method.The significant threshold is p<0.05.
We report statistical analysis on the test set.

Fig. 2
displays a qualitative comparison of the model's prediction, comparing annotations by experts A, B and C in two patients with different image quality.The model prediction visually agrees with the training expert A (ground truth) as well as with the test experts B and C. ?

Figure 1 .
Figure 1.Scatter plots of Volume Agreement between Experts and Model on Test set Top row: Top row: Inter-Expert and Model-Expert Agreement for expert B, Bottom row: Top row: Inter-Expert and Model-Expert Agreement for expert C, R= Spearman's Correlation Coefficient, Gray Area = 95% confidence region, Black dots = individual data points.The gray areas are smaller in the model-expert comparisons (rightmost column) indicating a lower variance for the predicted volumes.

Figure 2 .
Figure 2. Qualitative analyses of experts A, B, and C and the Prediction of the Model.Patient 1 (left): higher quality NCCT Patient 2 (right): lower quality NCCT.Experts A, B, and C agree on the location and volume of the stroke.The model prediction (last row) agrees as well with the test experts B and C as with the training expert A .

|/ 4 2x28x256x256Figure S. 1 .
Figure S.1.Model ArchitectureThe modified nnUNet configuration includes a large patch size of 28x256x256, 7 stages with two 3D convolutions per stage.The preprocessed input had 2 channels: the CT image and a mirrored CT image in which the ipsilateral hemisphere is replaced by a mirrored version of the contralateral hemisphere.The output of the model was a segmentation mask), a spacing of (3.00, 0.45, 0.45) and dimensions of 22-56 x 512 x 512.The parameter space after each stage is denoted in pytorch's tensor convention (channels x depth x height x width) .

Figure S. 2 .
Figure S.2.Segmentation Instruction Document for Experts

Table 1 .
Characteristics of randomized and non-randomized patients from the DEFUSE 3 dataset

Table S . 5 .
Definitions of Performance Metrics for Medical Image Segmentation