Deep learning-based detection and segmentation of diffusion abnormalities in acute ischemic stroke

Background Accessible tools to efficiently detect and segment diffusion abnormalities in acute strokes are highly anticipated by the clinical and research communities. Methods We developed a tool with deep learning networks trained and tested on a large dataset of 2,348 clinical diffusion weighted MRIs of patients with acute and sub-acute ischemic strokes, and further tested for generalization on 280 MRIs of an external dataset (STIR). Results Our proposed model outperforms generic networks and DeepMedic, particularly in small lesions, with lower false positive rate, balanced precision and sensitivity, and robustness to data perturbs (e.g., artefacts, low resolution, technical heterogeneity). The agreement with human delineation rivals the inter-evaluator agreement; the automated lesion quantification of volume and contrast has virtually total agreement with human quantification. Conclusion Our tool is fast, public, accessible to non-experts, with minimal computational requirements, to detect and segment lesions via a single command line. Therefore, it fulfills the conditions to perform large scale, reliable and reproducible clinical and translational research.


Introduction
Stroke is a major cause of death and long-term disability in US, with increasing mortality rates among middle-aged adults. 1 Defining the stroke core in diffusion weighted images (DWIs), a Magnetic Resonance Imaging (MRI) sequence highly sensitive to the acute lesion, is a major benchmark for acute treatment.
The importance of fast locating and objectively quantifying the acute damage has been reinforced by many stroke trials [2][3][4][5] .Clinical research also relays on objective quantification of lesions to access brain function (as lesion-symptoms studies 6 ), to stratify populations in clinically relevant groups, and to model prognosis [7][8][9][10] .Due to the great variability of stroke population and lesions in both biological and technical aspects, stroke research depends on large datasets and, consequently, on automated tools to process them with accuracy and high levels of reproducibility.Therefore, accessible tools that can fast and efficiently detect and segment diffusion abnormalities in acute strokes are well anticipated by the clinical and research communities.
Traditional machine learning algorithms use texture features 11 , asymmetric features 12 , abnormal voxel intensity on histogram 13 or probabilistic maps of populations, 14,15 , or advanced statistical models such as support vector machine 16 and random forest 17 , to detect and segment diffusion abnormalities in acute strokes.Although these methods represent resourceful statistical approaches and can work reasonably in high resolution images and homogeneous protocols, the low-level features these methods utilize such 2/40 as intensity, spatial, and edge information reduce their capability to capture the large variation in lesion pattern, especially in typical clinical, low-resolution and noisy data.
As the prosperous progress in Deep Learning (DL) in the past two decades, several DL neural network models [18][19][20] , such as the convolutional neural networks (CNNs), performed better brain lesion detection and segmentation than traditional methods.A landmark was the introduction of UNet 21 , to re-utilize coarse semantical features via skip connections from encoders to decoders in order to predict better segmentation.
The 3D networks suffer from issues like gradient vanishing and lack of generalization, due to their complex architectures.Training a large number of parameters requires a large number of training samples, which is not typically available when utilizing real clinical images.As a result, most prior DL methods 20,30,31 were developed in 2D architectures, ignoring the 3D contextual information.In addition, they were mostly trained and/or tested on 2D lesioned brain slices; therefore, they are not generalizable in clinical applications where there is no prior information about the lesion location.More recently, "2.5D" networks 32 , 3D dual-path local patch-wise CNNs, DeepMedic 31 and other CNNs proposed in the Ischemic Stroke Lesion Segmentation (ISLES) challenge, 18 aimed to improve local segmentation.However, they still do not fully utilize the whole-brain 3D contextual information which might lead to more false positives, especially in "lesion-like" artefacts commonly seen in clinical DWIs.Furthermore, previous networks were mostly evaluated on segmenting late sub-acute and chronic ischemic abnormalities in high resolution MRIs, with research-suited homogenous protocols [32][33][34] which do not reflect the challenges of real clinical data.To our best knowledge, Zhang et al. 35 was the first to reveal the potential of the 3D whole-brain dense networks on low-resolution clinical acute lesion segmentation.However, the relatively modest sample size utilized (242 clinical low-resolution images for training, validating and testing the models, plus 36 downsampled high-resolution images to test for generalization), compared to the large amounts of trainable parameters and high variations of stroke biological features, raises questions about the generalization and 3/40 robustness of the derived models.
The requirement of large amounts of data for training and testing DL models is a common challenge for achieving good generalization and efficiency in practical applications.Artificial data augmentation does not necessarily mirror the biological heterogeneity of strokes, and imperfectly reflects the noise, low resolution, and technical variability of clinical data.Models developed in modest and homogeneous samples, or in artificial augmented data, might be less efficient in heterogeneous and noisy clinical samples. 36Lastly, the lack of accessibility reduces the potential generalization and translation of previous methods: currently, there is no platform that allows regular users (not imaging experts) to readily and rapidly detect and segment diffusion abnormalities in acute stroke clinical images.
In this study, we developed DL network ensembles for the detection and segmentation of lesions from acute and early subacute ischemic strokes in brain MRIs.Our network ensembles were trained and validated in 1390 clinical 3D images (DWIs) and tested in 459 images.The lesion core was manually delineated in all the images, serving as "ground truth".Furthermore, we evaluated false positives on extra 499 DWIs of brains with "not visible" lesions.To our best knowledge, this is the largest clinical set ever used.We also tested our trained models on an external population of 140 individuals scanned at the hyperacute stroke phase and 24 hours later (STIR dataset 37 ).Our results show that our model ensembles are in general comparable to 3D path-wise CNNs (as "DeepMedic") for lesion segmentation, performing superiorly for detection and segmentation of small lesions, with much lower false positive rate.Our model is generalizable and robust on the external dataset, over the acute phases.Finally, our model is readily and publicly available as a tool with minimal implementation requirements, enabling non-expert users to detect and segment lesions in about one minute in their local computers, with a single command line.

Cohort
This study included MRIs of patients admitted to the Comprehensive Stroke Center at Johns Hopkins Hospital with the clinical diagnosis of ischemic stroke, between 2009 and 2019 (Flowchart for data inclusion in Figure 1).We have complied with all relevant ethical regulations and the guidelines of the Johns Hopkins Institutional Review Board, that approved this study.From the 2,348 DWIs quality-4/40 controlled for clinical analysis, 499 images did not show visible lesion.These mostly include Transitory Ischemic Strokes (TIA) or strokes with volume under the image resolution.They are called images with "not visible" lesions and were used to calculate false positives in this study.The other 1,849 DWIs showed lesions classified by a neuroradiologist as result of acute or early subacute ischemic stroke, with no evidence of hemorrhage.The lesion core was defined in DWI, in combination with the Apparent Diffusion Coefficient maps (ADC) by 2 experienced evaluators and was revised by a neuroradiologist until reaching a final decision by consensus (details in Methods).This manual definition was saved as a binary mask (stroke=1, brain and background=0) in the original image space of each subject.
The DWIs with ischemic lesions were randomly split into training dataset (n=1,390), which was used to train and validate all subsequent models, and testing dataset (n=459), which was used exclusively for testing models.A second external dataset, STIR, was used to test the generalization of our models in a completely unrelated population (Flowchart in Figure 1).We have complied with all relevant STIR regulations for data usage.From the STIR dataset, we included 140 subjects scanned twice: at the hyperacute stroke event (here called "STIR 1") and up to 24 hours later ("STIR 2").The longitudinal aspect of STIR enables to evaluate the performance of our models according to the stroke's onset.
The demographic, lesion and scanner profiles for all the datasets are summarized in Table 1.The distribution according to arterial territories (MCA > PCA > VEB > ACA) and the demographic characteristics reflect the general population of stroke patients.MRIs were obtained on seven scanners from four different manufacturers, in different magnetic fields (1.5T ( 60%) and 3T), with more than a hundred different protocols.The DWIs had high in plane (axial) resolution (1x1mm, or less), and typical clinical high slice thickness (ranging from 3 to 7 mm).Although a challenge for imaging processing, the technical heterogeneity promotes the potential generalization of the resulting developed tools.The demographics (age, sex, race, time from symptoms onset, and NIHSS) and lesion characteristics (e.g., volume, intensity, location) were balanced between training and testing ischemic sets (Table 5, Supplementary Information).

Lesion segmentation with unsupervised ("classical") methods
An important question is how the lesion segmentation generated by unsupervised classic methods for abnormal voxel detection compares with DL; and whether the combination of supervised and non-

5/40
supervised methods improves the performance.To investigate these questions, we generated ischemic probabilistic maps of abnormal voxels, here called "IS", via the modified c-fuzzy algorithms 15 .The best model (Table 4, Supplementary Information) showed modest efficiency on the Testing dataset (Dice: 0.45 ± 0.26) and in STIR 2 (Dice: 0.48 ± 0.28) and was even less efficient in STIR 1 (Dice: 0.23 ± 0.22).Details of these calculations and their efficiency on lesion segmentation are in Methods.

General comparison of DL models' performance
Our proposed 3D network, called "DAGMNet" (Figure 2), was compared with other generic benchmark networks, including FCN, U-Net, and DeepMedic.All models were compared to the manual lesion delineation.
In general, "CH3" models that utilize 3 channel inputs (DWI+ADC+IS, the probabilistic ischemic maps obtained by the modified c-fuzzy method) performed slightly better than "CH2" models (that have only DWI and ADC as inputs).This is particularly noticeable in STIR 1, in a trade-off a slight increase in false positives.It is likely that the extra "IS" channel conditions the network attention to regions of abnormal voxel intensity.This improves the model performance in most of the cases, as the identified abnormal voxels truly correspond to lesions.On the other hand, it might increase the false positives in a minority of cases, in which the abnormal voxels correspond to DWI artefacts.
As depicted in Figure 4 and Table 2, our proposed model, DAGMNet_CH3, outperforms generic UNet and FCN in all the testing datasets, with higher Dice scores and higher Precision.This is particularly significant by comparing the performances of DAGMNet and FCN in the Testing dataset and STIR2 (Figure 10, Supplementary Information).This suggests the robustness and generalization of our proposed model to external data.Compared to DeepMedic, our proposed model shows higher Dice in the testing dataset, comparable Dice in STIR 2, and lower Dice in STIR 1, although all differences are not statistically significant.DAGMNet_CH3 also shows significant higher precision than DeepMedic in the Testing dataset and STIR2, but lower sensitivity, with a higher number of false negatives in the Testing set.On the other hand, DeepMedic detected significantly more false positives than all other models, both subject-wise and voxel-wise.For instance, DeepMedic claims false-alarm lesions in more than 55% (twice as our model) of the "not visible" lesion cases. 6/40
By grouping lesions according to their location in the major vascular territories (Anterior, Posterior, Middle Cerebral Arteries -ACA, PCA, MCA -and the Vertebro-Basilar -VB -territory), we observed that all models had their worse performance in ACA and their best performance in MCA (Figure 4).As shown in Table 1, the MCA lesions are, in average, larger than lesions in other territories, while the ACA territory contains smaller lesions with lower DWI contrast and is more prone to DWI artfacts.Our models outperformed DeepMedic and generic UNet in ACA and VB territories.Last, the trade-off between precision and sensitivity was clear, for all the models, in MCA lesions.

Performance according to affected hemisphere, population demographics, and scanner profile
In general, all the models performed consistently, regardless the affected hemisphere, patients' sex and race, symptoms onset to MRI time, scanner manufacturers and magnetic strengths, as indicated in Figure 5 and Figure 11 (Supplementary Information).The only exceptions were DeepMedic and FCN_CH2, that had significantly lower Dices in Manufacturer 3 (GE), compared to Manufacturer 1 (Siemens) with p-values 0.0223 and 0.0413, retrospectively.However, this is interpreted with caution given the small sample of Manufacturer 3 (GE) in the Testing dataset.

Relationship of the models' performance with lesion volume and contrast
The models performance, represented by the Dice agreement between the "human-defined" and the "network-defined" (predicted) lesion positively correlated with the lesion volume and lesion contrast in DWI, and negatively correlated with the lesion contrast in ADC (Figure 6).This means that, as expected, large and high-contrast lesions are easy to define, both by human and by artificial intelligence.Our proposed model has the lowest correlation between the Dice and lesion volume.Because the Dice score is sensitive to the false positives, large lesions tend to have higher Dice than small lesions even for methods with uniform performance across lesion volumes.The decrease of the correlation between Dice and volumes found for our proposed method might result from its superior performance in small lesions.The correlation between the human-defined lesion volume versus the predicted lesion volume is high for all the models (Table 2 and Figure 6).Our proposed model showed the highest correlation coefficient between the human-defined lesion contrast versus the predicted lesion contrast, implying the highest agreement with human annotation.

Discussion
Although several multicenter and clinical trials [2][3][4][5] as well as lesion-symptom mapping studies 6 reinforced the need for the quantification of the acute stroke core, there is a gap to be filled by free and accessible technologies that provide fast, accurate, and regional-specific quantification.Commercial platforms 38 are restrictive and do not provide the 3D lesion segmentation required for lesion-based studies or for objective clinical application.Refining methods and disseminating technologies to improve the current lesion estimation will enable better tuning parameters of clinical importance, improving individual classification, and modeling anatomico-functional relations more reliably and reproducibly.We proposed a DL model for acute ischemic lesion detection and segmentation trained and tested on 2,348 clinical DWIs from patients with acute stroke, the largest dataset employed so far.We further tested this model on an independent, external dataset of 280 images (STIR).The utilization of a large, biologically, and technically heterogeneous dataset, overcomes previous limitations of DL models developed in much smaller datasets, as their questionable generalization potential 36 .The use of real clinical data, that usually imposes extra challenges to image processing, such as low resolution, low signal to noise, and lack 8/40 information from multiple contrasts, guarantees the robustness of the developed models for large scale, real scenario applications.
Our proposed model (DAGMNet) showed superior performance than generic models such as UNet and FCN, and DeepMedic, in our independent testing dataset.Our method was significantly superior on segmenting small lesions, a great challenge for previous approaches, while presenting low rate of false positive detection.This might be a result of the use of large amounts of training data, as well as a contribution of the attention gates and the full utilization of brain morphology by our 3D network, which increases the robustness to technical and biological DWI artefacts.This superiority was evidenced by our method's high performance in vascular territories that contains small lesions and are artefact-prone such as ACA and VB (illustrative examples in Figure 7).The balanced precision and sensitivity, and the lower dependence on stroke volume and location compared to other methods, make our model well suited not only for lesion segmentation but also for detection, an important component for clinical applications.In addition, our proposed pipeline showed stable performance regardless the affected hemisphere, population profile and scanner characteristics, other important conditions for large-scale applications.
In the external testing dataset (STIR), our method's performance was again superior or rivaled with that from others.An exception was the lowest Dice achieved in STIR 1, compared to DeepMedic.STIR 1 contains patients with hyperacute strokes (subtle hyperintense or normal intense lesions in DWI).These types of lesions are less frequent in patients with acute and early subacute strokes in our training dataset and particularly challenging for whole-brain-wise networks, compared to local patch-wise networks like DeepMedic.Similarly, our method produced less accurate segmentation in a few cases (n=4) with large, late subacute lesions in STIR 2, that presented subtle intensity contrast in their boundaries (Example J in Figure 7).Note, however, that the advantage on segmenting hyperacute lesions and lesions with subtle contrast was traded-off by the low precision and high false positive rate of DeepMedic in SITR.The incorporation of the patients with hyperacute strokes in our training dataset, a follow-up local patch-wise segmentation network, and/or using DL techniques like transform learning might ameliorate these issues are in our future plans.Last, a common source of failure for all methods were errors in skull stripping that increased false positives outside the brain contour.This only occurred in a small number of STIR images (n=7) that presented a high level of background noise due to an outdated acquisition protocol, therefore, is 9/40 unlikely an issue for future applications.
The automated DL models (our proposed DAGMNet, DeepMedic, and Unet) achieved comparable lesion quantification performance to our experienced evaluators, because their agreement with human delineation was close to the inter-evaluator agreement (Dice: 0.76 ± 0.14, as described in Methods).Among all the tested models, our proposed model showed the highest correlation between the contrast of the predicted lesion and the human-defined lesion, indicating the highest agreement with human evaluation.
Still, we noticed circumstantial disagreements of our method with the human evaluators in 24 cases (5% of the testing set) , in which the retrospective radiological evaluation favored the results of the automated segmentation, particularly regarding to the boundaries definition (Figure 7H).Similarly, by reviewing the "false-positive subjects" (cases in which models' predicted lesions but were initially classified as "not visible" lesions by visual radiological inspection), we identified 29 cases in which the lesion was identified in retrospect, in the light of our method's prediction (example in Figure 7I).This illustrates possible advantages of automated methods in reliability and reproducibility.
In addition to be accurate on lesions detection and segmentation, robust to data perturbs (e.g., DWI artefacts, low resolution sequences, low signal to noise ratio, heterogeneous datasets), and generalizable to external data, our proposed method is fast: it utilizes half of memory required by its closest competitor, DeepMedic, inferring lesions in half of the time (less than a minute, in CPUs) (see Table 2).All these factors make it potentially suitable for real-time applications.Regardless of all these advantages, a method has no massive utility if only the developers or highly expert analysts are able to use it.Therefore, we made our method publicly accessible on https://www.nitrc.org/projects/adsand readily useful to run in CPUs with a single command line, and minimal installation requirements.To the best of our knowledge, this study provides the first DL networks for lesion detection and segmentation, trained, and tested over 2,000 clinical 3D images, available to users with different levels of access to computational resources and expertise, therefore representing a powerful tool for clinical and translational research.

Data Availability
The original data used for training and testing the models derive from retrospective clinical MRIs and, as for today, cannot be publicly released by regulatory restrictions.We will soon share the whole dataset 10/ 40  with the research community under data usage agreements and restricted access.

Lesion delineation
The lesion delineations were performed using ROIEditor.The evaluators looked for hyper intensities in DWI and/or hypo intensities (<30% average brain intensity) in ADC.Additional contrasts were used to rule out chronic lesions or microvascular white matter disease.A "seed growing" tool in ROIEditor was often used to achieve a broad segmentation, followed by manual adjustments.The segmentation was performed by two individuals highly experienced in lesion tracing (more than 10 years of experience).
Additionally, they were trained by detailed instructions and illustrative files, in a subset of 130 cases (random selected 10% of the dataset).These cases were then revised by a neuroradiologist, discussed with the evaluators, and (blinded) retraced, revised, and re-analyzed after 2 weeks.The inter-evaluator index of agreement, Dice score, in this set was 0.76±0.14; the intra-evaluator Dice was 0.79±0.12.Although this is a satisfactory level of agreement, it demonstrates that human segmentation is not totally reproducible, even when performed by experienced, trained evaluators.After consistent consensus agreement was achieved in the initial set, the evaluators started working on the whole dataset.The neuroradiologist revised all the segmentations and identified the sub-optimum cases that were subsequently re-traced.The segmentations were revised as many times as necessary, until reaching final decision by consensus among all evaluators.

Image pre-processing
To convert the images to a common space, where the next preprocessing steps will take place, we: 1) Resampled DWI, B0 (the image in the absence of diffusion gradients), and ADC into 1 × 1 × 1 mm 3 , 2) Skull-stripped with an in-house "UNet BrainMask Network", 3) Used "in-plane" (IP) linear transformations to map the images to the standard MNI (Montreal Neurological Institute) template, 4) Normalized the DWI intensity to reduce the variability and increase the comparability among subjects, 5) "Down-sampled" 11/40 (DS) images to reduce the memory resource requirement in the next steps (e.g., DL networks).Details of these procedures follow.

Automated skull stripping -BrainMask network
To decrease computational complexity and time required for skull stripping, we built an "UNet BrainMask Network" using DWI, B0 images, and gold standard brain masks.The gold standards brain masks were generated as follows: first, all DWI and B0 images were resampled into 1 × 1 × 1 mm 3 and skull striped by a level set algorithm (available under ROIStudio, in MRISturio 39 ), with W 5 = 1.2 and 4, respectively (see explanation about choice of parameters in MRIstudio website 39 ).The brain masks were the union of masking on DWI and B0.Then, these brain masks were further manually corrected by our annotators, serving as ground true for the "UNet BrainMask Network".
To train our "UNet BrainMask Network", all images are mapped to MNI and down-sampled to 4×4×4 mm 3 .The final brain mask inferenced by the network was then post-processed by the closing and the "binary_fill_holes" function from Python scipy module, upsampled to 1 × 1 × 1 mm 3 , and dilated by one voxel with image smoothing.The Dice agreement between the "gold-standard" brain masks and those obtained with our network was above 99.9%, in an independent test dataset of 713 brains.The average processing time was about 19 seconds (against 4.3 min taken by the level-set algorithm 39 ), making it suitable for large scale, fast processing.

IP-MNI and IP-MNI DS space
We used sequential linear transformations (3D rigid translation followed by scaling/rotation along x and y axis) to map B0, less affected by the acute stroke, into JHU_MNI_B0 39 , a template in MNI space.No rotation was performed along the inter-slice coordinate (z-axis), aiming to preserve the image contrast and the continuity of manual annotations, and to ameliorate issues related to the low resolution and voxel anisotropy.We called this standardized space "in-plane MNI"(IP-MNI).The IP-MNI images were then padded into 192 × 224 × 192 voxels, to facilitates the next step of "down-sampling" (DS) by (2, 2, 4) in (x, y, z) axes and max-pooling operations, which reduces the memory resource requirement in DL networks.The voxel dimension of the down-sampled images are 96 × 112 × 48 voxels, which is called IP-MNI DS space. 12/40

DWI intensity normalization
Intensity-normalization increases the comparability between subjects and, as normalizing images to a standardized space, is crucial for diverse image analytical processes.Although the lesion might affect intensity distribution, we assume that the majority of brain voxels are from healthy tissue and can be a good reference for intra-and inter-individual comparison.We used bimodal Gaussian function in Equation (1) to fit the intensity histogram of DWI and cluster two groups of voxels: the "brain tissue" (the highest peak) and "non-brain tissue" (the lowest peak at lowest intensities, composed mostly by cerebrospinal fluid).
, where a i , b i , c i are the coefficients of the scale, mean, and standard deviation of Gaussian distribution.We note that the distributions are much more homogeneous, and the individual variations are smaller after intensity normalization.More importantly, intensity differences between different magnetic fields and scan manufacturers are ameliorated after intensity normalization.Our choice for this intensity normalization approach is, therefore, based on its success to capture the contrast between lesioned voxels and normal tissue voxels, while minimizing variations in intensity range across subjects.The influence of other intensity normalization approaches, from simple z-score normalization up to different self-supervised methods 40,41 , is subject to further evaluation. 13/40

Lesion segmentation with unsupervised ("classical") methods
We tried two "classical" methods for lesion segmentation: 1) "t-scores" maps of DWI and ADC for individual brains 14 , via comparing them to the set of brains with "not visible" lesions, and 2) ischemic probabilistic maps of abnormal voxels, here called "IS", via the modified c-fuzzy algorithms 15 .
σ dwi , σ adc , and σ id are in z-score scale, rather than percentage.W f whm is in the unit of pixel.The best model was W f whm = 4, σ dwi = 2, σ adc = 1, σ id = 3.5.The Dice and Net Overlap (defined as 14 ) of the top four configurations are summarized in Table 3 (Supplementary Information).

Modified c-fuzzy method 15
We first created intensity mean and standard deviation "templates" in IP-MNI space, I s,µ (x, y, z) and I s,σ (x, y, z) for s = dwi and s = adc, separately, using the normalized DWI and ADC of cases with "not visible" lesions.We then modified and computed the dissimilarity equation for individual DWI and ADC, I s (x, y, z) for s = dwi and s = adc, compared to I s,µ (x, y, z) and I s,σ (x, y, z) similar as in Guo et al. 15 .
I s (x, y, z) = tanh( I s (x, y, z) − I s,µ (x, y, z) where α s is the parameter used for controlling sensitivity.The probability map of abnormal low-intensity (H1) and high-intensity (H2) voxels were computed as: To generate the probabilistic maps of ischemic voxels, P IS , we looked for these voxels to have abnormal high intensity(H2) in DWI and abnormal low intensity(H1) in ADC compared to the templates, and also relatively high intensity in its own individual DWI, is defined as follows: , where Q(•) is the standard normal distribution Q function, and t id (x, y, z) is the t-score of DWI voxel intensity, compared to the mean and standard deviation of the whole-brain voxels, µ id and σ id , in each individual DWI.Like what was described for the t-score method, the DWI and ADC images were smoothed by Gaussian filter with different FWHM.The six following parameters' combinations were optimized by cross validation in training dataset: W f whm ∈ {2, 3, 4, 5}, α dwi ∈ {0.5, 1, 1.5, 2, 2.5}, λ dwi ∈ {2, 3, 4}, α adc ∈ {0.5, 1, 1.5, 2, 2.5}, λ adc ∈ {2, 3, 4}, σ id ∈ {2, 2.5, 3, 3.5}.σ id is in z-score scale, rather than percentage.W f whm is in the unit of pixel.The best parameters' configuration was W f whm = 2, α dwi = 1.5, The performance in all datasets is summarized in Table 4.
As shown in Figure 9 (Supplementary Information), the modified c-fuzzy method performed better than the t-score method for lesion segmentation.In addition, the t-score method performed worse on our testing dataset (Dice about 0.37) than what is described by the developers 14 (Dice about 0.5) Therefore, the classical t-score method was considered insufficiently efficient to segment lesions in our large and heterogeneous clinical dataset.The architecture of our proposed 3D DAGMNet, depicted in Figure 2, is equipped with intra skip connections as UNet3+ 24 , fused multi-scale contextual information block, deep supervision, L1-regularization on final predicts, Dual Attention Gate (DAG) [25][26][27] , self-normalized activation (SeLU) 42 , and batch normalization.The details of the important components and training techniques/parameters are outlined in the following subsections.
DAGMNet was intuitively designed to capture semantical features directly from input images at different receptive scale levels, as MNet, and to segment lesions of various volumes with consistent efficiency, via deep supervision at each level.This aims to conquer the drawback that although big lesions can be easily detected at different receptive scale levels in the original generic UNet, small lesions features could be dropped after down-sampling pathway/max-pooling layer.Furthermore, The introduction of the inter-skip connections between layers in UNet structure and intra-skip connections as UNet3+ help model share and re-utilize features between different receptive scale levels with lower computational complexity than DenseUNet and UNet++.The final fuse block combines all-scale semantic features (from small lesion volume to large ones) to generate the final predict output.At the end of the fusion block, L1-regularization on predicts prevents the networks from false-positive claiming.
To overcome the high variability in lesion volume, shape, and location, DAG was utilized to condition the networks to emphasize the most informative components (both spatial and channel wise) from encoders' semantic features at each level prior to decoders, increasing the sensitivity to small lesions or lesions with subtle contrast.In DAG, spatial Attention Gate (sAG) was used to spatially excite the receptive for the most abnormal voxels, like the hyper-intensity in DWI or predicts from the third channel ("IS").On the other hand, the channel Attention Gate (cAG) was included to excite the most semantical/morphological features associated to ischemic lesions from artefacts.
The inclusion of information from classical methods ("IS") as the third channel aimed to help the networks to focus on abnormal voxels, even if in small clusters (small lesions).In addition, the batch normalization technique and SeLU activation function aim to self-normalize the networks, avoiding the gradient vanish problem usually faced in complex 3D networks.with the up-sampled decoded features from the second, third, and the fourth level, respectively.The decoder at the fourth level only takes the features from dual attention gate as input at the same level.
The final predict is fused by a fuse convolution block at the end.The fuse convolution block takes concatenated features decoded from decoders at each level as inputs.The intra skip connection at each level to fuse convolution block was implemented by transpose up-sampling layers.

Dual attention gate (DAG)
DAG has two parts: channel Attention Gate(cAG) and spatial Attention Gate(sAG).Both have two parts, squeezing and exciting parts as depicted in Figure 2A.
• cAG block could be considered as a self-attention function intrinsically exciting the most important feature channels.cAG squeeze all spatial features into a channel-wise descriptor by a global average pooling layer and a global max pooling layer.The most activated local spatial receptive fields would lead to the highest global average/max for its corresponding channel.Then the channel-wise dependencies are calibrated by a fully connected dense layer with sigmoid activation to generate a set of weights of the aggregated spatial information.Each channel will be excited by its corresponding weight post-to the cAG block.
• sAG block could be considered as a self-attention function intrinsically exciting the most important Pooling (GM pooling), Global Average Pooling (GA pooling), and 1 × 1 × 1 convolution layer.
Any local spatial semantical features will be excited by a 5 × 5 × 5 convolution layer with SeLU activation to condition network's attention locally.The sum of both excited spatial local features from both inputs is further calibrated by a 1 × 1 × 1 convolution layer with sigmoid activation.
sAG and cAG both take two inputs: "input 1" from low-level encoded features directly from images at the same receptive scale, "input 2" is all accumulated encoded features from all previous levels, including the current scale level.

SeLU: self normalization activation function
Compared to Rectified Linear Unit (ReLU), SeLU activation function was reported to have more power to reduce gradient vanish problem among more complicated network structures.Besides, its self-normalization makes the network learn more from morphological features and be less depending on image contrast.Hence, we used SeLU activation function in our proposed networks.

Deep supervision
Deep supervision is adopted in our model to learn hierarchical representations at each level.Each decoder and the fuse block are fed into a 3 by 3 by 3 convolution layer with sigmoid activation to generate side output, which is supervised by our ground true lesion annotation at its corresponding level.

Loss function
To ameliorate the imbalanced voxel classes issue (between the number of lesion and non-lesion voxels) and regularize the false positive rate predicted by networks, an hybrid loss function described in Equation ( 6) was utilized to train our proposed models.

18/40
, where L f use is the loss function supervised at the final output of the fusion block X f use and L i,side is the loss function supervised at the side output of the decoders at each level X de,i as follows: L f use = w gds L gds + w bbc L bbc + w r L 1 (7)   L i,side = w gds L gds + w bbc L bbc (8)   , where w gds = 1, w bbc = 1, w r = 1e − 5 due to hyperparameter tuning, L gds is the generalized dice loss function defined as 43 , L bbc is the balanced binary cross-entropy defined as 21 , and L 1 (p) is the L 1 regularization on all predicted voxels defined as , where p (x,y,z) is the predicts from networks at (x, y, z) coordinates.The loss function is optimized during training step by ADAM optimizer with learning rate = 3E − 4. Learning rate will be factor by 0.5 when loss function is on plateau over 5 epochs with minimum learning rate = 1E − 5.

The dimension of the inputs and predicts during training and inferencing
The networks were trained and inferenced in IP-MNI DS space, which is 96 × 112 × 48 voxels.All images (DWI, ADC, IS) in IP-MNI space were down-sampled (DS) along x,y, and z-axis with stride of (2,2,4) into 16 smaller volumes in 96 × 112 × 48 × 3 voxels for 3-channel models, as shown in Figure 2B.
During training step, as shown in Figure 2B, one of these 16 down-sampled volumes are randomly selected to be the inputs of 3-channel networks for each subject in a selected batch (batch size = 4).This re-sampling strategy aims to increase the robustness of our networks to the image's spatial shifting and inhomogeneity.To make efficient back propagation for training networks, the down-sampled volumes were standard normalized to zero mean and unit-variant within the brain mask region for both DWI and ADC channels.
During inferencing step, the 16 lesion predicts from networks in IP-MNI DS space were stacked according to the way their inputs volumes were down-sampled in the original coordinates, to construct the final predict in IP-MNI space (Figure 2C).Then, the lesion mask was "closed" with connectivity 1 voxel.The predicted lesion binary mask (predicted value > 0.5) in IP-MNI space was then mapped back to individual original space.We refined the final prediction by removing clusters with < 5 pixels in each slice, which is the smallest size of lesions defined by human evaluators.

System implementation
All the evaluated methods and models were built with TensorFlow (tensorflow-gpu version is 2.0.0) and Keras (2.3.1)framework on Python 3.6.The experiments run on a machine with an Intel Core (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz) with 2 NVIDIA TITAN XP GPUs (with CUDA 10.1).The average inference time over all testing dataset and memory requirements are listed in Table 2. 20/40

Evaluation of efficiency and comparison among methods
The performances metrics, Dice score, Precision, Sensitivity, and Subject Detection Rate (SDR), are defined as follows.
Dice = 2TP 2TP + FN + FP (10)   Precision = TP TP + FP (11)   Sensitivity = TP TP + FN (12)   SDR = the number of the subjects detected with lesions the number of the total subjects in dataset TP, FP, and FN, respectively, denote the number of True Positives, False Positives, and False Negatives in the unit of voxel.The SDR represents the percentage of the subjects whose lesions are detected.Therefore, high SDR indicates good performance in the datasets of subjects with visible lesions (our testing dataset and STIR) but indicates low performance in the dataset of subjects with "not visible" lesions, because the predicted lesion voxels would be false positives.Hence, in the dataset of subjects with "not visible" lesions, it is false positive SDR.
All models were compared to the manual delineation of the stroke (used here as "gold-standard") on original raw space.All models were trained, validated, and optimized in the same training dataset, and tested in the independent testing dataset and in the external dataset, STIR.For fair comparison, all models were designed to utilize similar number of trainable parameters, except for DeepMedic, whose default network uses more trainable parameters than all other networks (Table 2).
Besides comparing global performance in the whole brain, we compared models' performance according to lesion volume and location (by vascular territories) and analyzed the Spearman correlation coefficient of performance with the lesion volume and contrast in DWI and ADC.The "scipy.stats"module is used to calculate the Spearman correlation coefficient.The 95%CI for SDR is calculated with z-score = 1.96 and the 95%CI for the Spearman correlation coefficient is calculated by Fishers' transformation with z-score = 1.96.
The lesion contrast in DWI, κ DW I , and in ADC, κ ADC , are defined as the ratio between the average

21/40
intensity of the lesion voxels and the average intensity of the "normal brain tissue" voxels in DWI and ADC, respectively.High κ DW I and low κ ADC are expected in acute and subacute lesions which tend to be "bright" in DWI and "dark" in ADC.Furthermore, false positive analysis was conducted on voxel-and subject-basis among the 499 images with "not visible" lesions.

Statistical significance testing
The statistical significance testing was performed by Anova test in module "bioinfokit" for continuous data, and by Chi-square test in "scipy.stats.chi2_contingency"module for categorical data.The statistical significance on the demographic, lesion and image profiles distribution between training and testing datasets is shown in Table 5 (Supplementary Information).For race/ethnicity and MRI manufacturer, we compared the major groups.For symptoms onset to MRI time, we compared groups with onset time < and ≥ 6 hours.The statistical comparison among models' performance in the Testing and STIR dataset is shown in Figure 10 (Supplementary Information).The statistical comparison of models' performance regarding to demographic, lesion and image profiles is illustrated in Figure 11 (Supplementary Information).I) Case in which the retrospective analysis favored the automated prediction, rather than the human evaluation for: H) lesion delineation, and I) lesion prediction (this case was initially categorized by evaluators as "not visible" lesion, but the small lesion predicted by our model was confirmed by follow-up).J) Lesion of high-intense core but subtle boundary contrast, which ameliorates the discriminative power of all 3D networks.In this case, the patch-wise DeepMedic had the best agreement with manual annotation in the boundaries at the cost of detecting false-positives (arrow).

33/40
a i , b i , c i are calculated by least-square fitting the bimodal Gaussian function to the intensity histogram of individual DWI.DWI intensities are normalized to make the "brain tissue" intensity with zero mean and one standard deviation.Figures 8A (Supplementary Information) shows that the DWI intensity distribution of voxels in a brain with ischemic lesions (blue) and in a brain with "not visible" lesion (orange), prior-to (left column) and post-to (right column) intensity normalization.We note that the preservation of the minor peak at high intensities in the brain with ischemic lesion indicates the preservation of the lesion contrast after normalization.The Figure 8B-D (Supplementary Information) show the distribution of DWI intensities in groups of images, prior-to (left column) and post-to (right column) intensity normalization.

4. 7 . 8 Hyperparameter
For searching hyperparameters, such as weights for loss functions and networks structures, 20% of subjects from the training dataset were randomly selected as validation dataset, with the same random states for all experiment models.In each experiment, once the loss functions converged in validation dataset along the training epochs (200 epochs at top, early stops at 100 epochs if training and validation loss function converged early), we selected the best model from the snapshot models at every 10 epochs in the validation dataset.We chose maximum training epoch as 200 because most experiments of benchmark models converged after 80 to 120 epochs.For each experiment, We trained the same-type networks independently, with different training set and different resampled validation set, at least twice to check if similar performance would be achieved and avoid overfitting.Once the networks parameters (including weights for loss or regularization, different network layers, depth. . .etc) were finalized according to their best dice performance in the validation sets, we used the whole training dataset, including the validation dataset, to train the final deployed models and to make the loss function and dice scores converge to the similar level as the previous experiment.This allowed us to fully use all training dataset and capture the population variation.All Testing dataset and the STIR datasets were totally unexploited till this step was done.

Figure 9 .
Figure 9. Title: The performance of classic models.Legend: The boxplots of Dice and Net overlap for the classic t-score method and modified c-fuzzy method in Training, Testing, STIR 1 and STIR 2 datasets.In each Whisker's boxplot, the white square indicates the average, the black horizontal line indicates the median.The whisker is a representation of a multiple (1.5) of interquartile range (IQR).

16/40 4.7.2 Encoder and decoder
As shown in Figure2, each encoder or decoder convolution block is composed with two consecutive convolution layers with batch normalization and SeLU activation.The number of features at the first level is denoted as "N f ".For our proposed model, N f was 32.At the encoder part, input images were down-sampled by (2, 2, 2) to each receptive scale level and encoded by an encoder convolution block with N f = 32 for each level.At the second, third and fourth level, the encoded features from the previous level were concatenated with the features at the present level and furthered encoded by an encoder block with feature number 2 × N f , 4 × N f , and 8 × N f , respectively.At the decoder part, each level has an decoder convolution block with deep supervision.The feature numbers of decoders at the first, second, third and fourth level are N f , 2 × N f , 4 × N f , and 8 × N f .The input for decoders at the first, second, and third level are features from dual attention gate concatenated

17/40 receptive
fields.sAG squeezes channel-wise features into spatial statistics with Global Maximum

Table 3 .
Configuration of parametersDice (mean±std) Net overlap (mean±std) W f whm σ dwi σ adc σ id Dice and Net Overlap of the top four parameters' configurations of the t-score method.Dataset Dice (mean ± std) Net overlap (mean ± std)

Table 4 .
Dice and Net Overlap of the top parameters' configuration of the modified c-fussy method over all datasets.

Table 5 .
Statistical testing on the demographic, lesion and image profiles distributions between training and testing dataset.Significant differences are marked with "*". 40/40