DEEPMIR: a deep neural network for differential detection of cerebral microbleeds and iron deposits in MRI

Lobar cerebral microbleeds (CMBs) and localized non-hemorrhage iron deposits in the basal ganglia have been associated with brain aging, vascular disease and neurodegenerative disorders. Particularly, CMBs are small lesions and require multiple neuroimaging modalities for accurate detection. Quantitative susceptibility mapping (QSM) derived from in vivo magnetic resonance imaging (MRI) is necessary to differentiate between iron content and mineralization. We set out to develop a deep learning-based segmentation method suitable for segmenting both CMBs and iron deposits. We included a convenience sample of 24 participants from the MESA cohort and used T2-weighted images, susceptibility weighted imaging (SWI), and QSM to segment the two types of lesions. We developed a protocol for simultaneous manual annotation of CMBs and non-hemorrhage iron deposits in the basal ganglia. This manual annotation was then used to train a deep convolution neural network (CNN). Specifically, we adapted the U-Net model with a higher number of resolution layers to be able to detect small lesions such as CMBs from standard resolution MRI. We tested different combinations of the three modalities to determine the most informative data sources for the detection tasks. In the detection of CMBs using single class and multiclass models, we achieved an average sensitivity and precision of between 0.84–0.88 and 0.40–0.59, respectively. The same framework detected non-hemorrhage iron deposits with an average sensitivity and precision of about 0.75–0.81 and 0.62–0.75, respectively. Our results showed that deep learning could automate the detection of small vessel disease lesions and including multimodal MR data (particularly QSM) can improve the detection of CMB and non-hemorrhage iron deposits with sensitivity and precision that is compatible with use in large-scale research studies.


Introduction
The aging brain is subject to various irreversible changes, some driven by the aging process itself and others that are associated with various pathologies, including vascular lesions and neurodegeneration [1][2][3][4] . On magnetic resonance imaging (MRI), particularly tuned to be sensitive for differences in magnetic susceptibility, focal accumulations of iron content can be visible. This includes lesions with iron content such as cerebral microbleeds (CMBs) and non-hemorrhage iron deposits in the basal ganglia. CMBs are small hemorrhages that can occur sporadically throughout the brain 5 . CMBs have been associated with cognitive decline and dementia 6 , and are considered a biomarker for small vessel diseases. The presence of lobar CMBs is also a marker for cerebral amyloid angiopathy [7][8][9] . Non-hemorrhage iron deposits are located in the deep structures of the brain, particularly in the basal ganglia. While an increase in iron concentration in the basal ganglia is expected in healthy aging 10 , focal accumulation of iron has been associated with neurodegenerative disorders in small scale studies [11][12][13] .
Most of our knowledge on the iron toxicity in the aging brain is limited by the fact that both CMBs and iron deposits could be difficult to distinguish from each other and from other similar lesions including calcification using conventional MRI techniques 14 . T2* gradient-recalled echo (GRE) and susceptibility-weighted imaging (SWI) are often used to clinically characterize CMB, with the latter being more sensitive for detecting CMBs 15,16 . CMBs can occur anywhere and appear as small rounded or ellipsoidal hypo-intense regions with a diameter of ten millimeters or less 7,14,17 . Nonhemorrhage iron deposits in the basal ganglia have irregular shapes and could be larger than CMBs 14 . Because hypo-intensities in SWI are not specific to CMBs and non-hemorrhage iron deposits, images with other tissue contrasts are required in order to identify other lesion types that can have similar low susceptibility signal on SWI, such as calcification 5,18,19 . The specificity for CMB detection can be increased by post-processing SWI-magnitude and phase data to derive quantitative susceptibility maps (QSM) 20,21 . In QSM paramagnetic tissue appears different from diamagnetic materials, and therefore this contrast is particularly useful for distinguishing nonhemorrhage iron deposits from calcifications 22,23 . While previous efforts have been made to automate the detection of microbleeds, all previous work neglected the detection of nonhemorrhage iron deposits in such automated framework [24][25][26][27][28][29][30] .
No work has been published to date on segmenting iron deposits in the brain using QSM with either a semi-or a fully automatic method. The advances made in MRI technology with QSM for iron content recognition are gaining more attention as cohort-based studies such as The Multi-Ethnic Study of Atherosclerosis (MESA) 31-33 include QSM in their imaging protocol, and thus exploit its advantages in delivering specific insights on iron toxicity in the aging brain. The focus in MESA is utilizing non-invasive methods to investigate common risk factors, preclinical disease states and manifest diseases using a standardized imaging protocol, which is applied to all participants 34 . On one hand, this is providing a unique opportunity to study widely ignored lesions such as iron deposits in vivo using MRI but on the other hand, this comes with additional challenges as such cohorts naturally include largely cognitively normal participants with a low lesion load, resulting in a very challenging task to automate.
In order to tackle the challenges inherent in the detection of these lesions, we developed a robust and fully automated deep learning-based method to detect CMBs and non-hemorrhage iron deposits in a cohort without extensive apparent brain tissue damage and having a low load of CMBs and non-hemorrhage basal ganglia iron deposits. We experimented with both single class and multiclass segmentation models using multiple MR sequences. Our experiments show that using multi-sequence MRI (especially QSM) improves the overall accuracy of detection. The main contributions of this study include the following: 1. We tackled the challenging problem of simultaneously detecting CMB and nonhemorrhage iron deposits. To our knowledge, this is one of the first reports to detect both types of lesions simultaneously. Often iron accumulation in the brain has been understudied due to the lack of appropriate techniques for detecting them in vivo in largescale epidemiological studies; 2. We found out the most suitable pulse sequence combination to automate the detection tasks by exploiting imaging information jointly; 3. We developed an effective and flexible neural network model that is specially tailored to the differential detection task. The proposed model can be easily adapted to segment additional lesions; 4. We achieved highly competitive detection performance on real-life data, demonstrating the effectiveness of the proposed approach in practical applications. 5. We also provide access to our source code and a few trained models via the GitHub link https://github.com/NAL-UTHSCSA/CMB_NHID_Segmentation

Results
We performed leave-one-out cross-validated evaluations for both single class and multiclass segmentation experiments using the 24 participants listed in Supplementary  Figure 1 shows an example of the automated segmentation of a CMB (indicated by the red arrow). Panel B in Figure 1 shows the segmentation of the focal iron deposits in the basal ganglia. In this figure, the model correctly segmented the iron deposit lesions (indicated by the green arrow) while rejecting an instance of calcification (indicated by the yellow arrow). The results of these experiments are reported in Table 1 and Table 2. Pearson's correlation and Bland-Altman mean difference and confidence intervals for single class and multiclass experiments are reported in Table 1 and Table 2 respectively, for the experiments with 24 participants. Overall, our experiments show that incorporating QSM in model training can increase the overall accuracy of CMB and iron deposit detection.
In the case of segmenting CMBs, the best performance in terms of average magnitude accuracy is seen with the model trained with SWI and QSM in both single class and multiclass experiments. The correlation coefficient between the prediction and ground truth was also highest (r=0.97 and r=0.99, for single class and multiclass results, respectively) when QSM was included in the training. For non-hemorrhage iron deposits, the single class model trained with all three modalities had the highest average magnitude accuracy and the multiclass model trained with SWI and QSM had the highest average magnitude accuracy. The correlation coefficient was also highest for models that included QSM for training (r=0.92 and r=0.91 for single class and multiclass results, respectively). Figure 2 shows a joint scatterplot of the single class experimental results and Figure  3 shows a joint scatterplot of the multiclass experiments.
In our dataset, we identified as an outlier a single individual with exceptionally many CMBs. A comparative analysis was done by removing this outlier from the dataset and repeating a similar cross-validated evaluation by retraining both single class and multiclass models. The results are detailed in Supplementary An additional leave-one-out cross-validated evaluation was done for the 24 participants using an implementation of the original U-Net 35 . The results of this experiment are reported in Supplementary We investigated the performance of the proposed DEEPMIR architecture for the simultaneous differentiation and labeling of both CMB and iron deposit labels against the performance of the original U-Net and a modified DEEPMIR architecture (having the same number of resolution layers as the original U-Net). We note that the proposed DEEPMIR model with 6 resolution layers has better overall sensitivity for detecting small lesions such as CMBs. Supplementary Figure 12 and Supplementary Figure 13 show examples of small lesions that the original U-Net and the modified DEEPMIR models were unable to detect, compared to the accurate detection by the proposed DEEPMIR architecture.

Discussion
We developed a deep learning framework for simultaneous segmentation of cerebral microbleeds and non-hemorrhage iron deposits using multi-modal MRI. To date, previously published methods for automated or semi-automated CMB detection have ignored iron deposits. In this study, we consider the iron deposit in the basal ganglia seen as hypo-intense lesions on SWI and confirmed by QSM to be iron-specific rather than mineralization. Those lesions may typically be labeled as possible or uncertain microbleeds on MARS 18 and BOMBS 19 mainly because of the limitation that T2* and SWI cannot differentiate iron content from mineralization. We overcome this limitation by including QSM in our study, which has shown to improve the overall accuracy for automated detection. To our knowledge, there are no studies that attempted to segment these focal iron deposits using SWI and/or QSM automatically. Our deep learning-based segmentation method presented here is filling in this gap. We have undertaken several experiments using both single class and multiclass models with different combinations of the available MR pulse sequences. We noted that the models which included QSM in training consistently performed better and the resulting predictions had statistically high correlations when compared to the reference annotation.
Our approach has several advantages over the current state-of-the-art methods for CMB detection. First, by using deep learning our model is capable of learning and generalizing features rather than rely on feature vectors derived with conventional image processing algorithms 28-30 , Fourier shape descriptors 36 or probabilistic models 27 . Second, we employ end-to-end learning by using a single model (or network). Previously published methods that used deep learning employed multiple stages consisting of (a) a candidate generation stage which use either conventional image processing methods 24,26 or an initial (and separate) deep learning-based model 25 for identifying possible CMBs, and (b) a false positive reduction stage in the form of a CNN-based network 24-26 . Our single-stage design allows for greater flexibility, for example in retraining with different or larger data sets, adding additional class labels, or using different modalities, while achieving sensitivity and precision comparable to published results. Third, we trained with different sets of input imaging modalities. Combinations of imaging modalities allowed our models to reject mimics such as calcifications without explicit provisions (as shown in Figure  1, Panel B). Supplementary Figure 9 in the supplementary materials (Supplementary Section 5) shows an example of mineralization being segmented as iron deposits when the model was trained with only SWI. The models in publications 25,26 used SWI only and therefore may not be capable of recognizing and rejecting mimics. The method in publication 24 utilizes SWI-phase and magnitude images along with QSM, but did not consider iron deposits in the basal ganglia. Fourth, we experimented with a reduced number of layers (5 instead of 6 spatial resolution layers) and noted that having more layers can improve the overall results for detecting small lesions such as CMBs.
Our framework has achieved an excellent sensitivity of 89%. However, other studies 24-26 have reported higher precision in detecting CMBs in their samples. We would like to note that it is impossible to directly compare reported numbers from various machine learning models, due to differences in populations included, study settings and imaging and scanner charactherisitcs 37 .
Of particular importance is the fact that our sample was drawn from a relatively healthy population without significant brain trauma, injuries, or pre-existing neuro-pathologies whereas the studies in publications 24-26 had hundreds if not thousands of CMB lesions related to or caused by radiation therapy, stroke and traumatic brain injury.
One of the major challenges was the small size of the lesions and their potential presence throughout the brain. The average size of four voxels (or 6 mm 3 ) per CMB together with the generally low lesion burden of the study participants resulted in including only two CMB lesions/4 voxels on average per participant, resulting in a higher weight of a single lesion or error in the evaluation. In other words, missing a single lesion would result in a drop of sensitivity from one to 0.5 and a single false positive for a given participant would result in a drop of that participant's precision from one to 0.5 or 0.66. Similarly, a small number of false positives, in absolute terms, can lower the average precision substantially. In general, our models over-segmented the data in terms of detecting more CMBs than were actually present ( Supplementary Figures 10 and 11 in the supplementary materials show examples of false positive CMBs). In all experiments using the aforementioned combinations of available imaging modalities, most of the lesions were detected and the average sensitivity was consistently above 0.75.
Notably, the sample used to train the model was a convenience sample from participants of the MESA study without particular clinical profile and without apparent brain disorders such as dementia, depression, or traumatic brain injury. Given the low number of lesions on average, our method achieved sensitivities that are comparable to state-of-the-art CMB segmentation/detection methods trained with large datasets. We expect that including more samples with more lesions would improve the precision. In general, most studies incorporating automated methods for large-scale abnormality detection or brain region segmentation incorporate a segmentation quality control step that could result in corrections or exclusions 1,38,39 . Thanks to the flexibility of our method, it is straightforward to increase the sample size.
In clinical terms, a larger number of CMBs is more likely to be clinically relevant. The proposed DEEPMIR method was trained and evaluated on a relatively small population and outputs the number of lesions and lesion segmentation maps for each participant. The next step would be to rigorously test and evaluate the proposed model on a larger sample size to ensure viable sensitivity, precision and overall accuracy, before applying it to a large cohort to determine the prevalence of lesions in the population. An adequately trained model can be used as a screening tool to flag participants with a high lesion load. DEEPMIR can also be used to generate an initial segmentation of lesions to accelerate manual annotation.
QSM is a good, non-invasive technique to distinguish between iron content and mineralization in the brain and showed a great advantage in improving the overall accuracy of CMB and iron deposit detection in the current study. While QSM is being recognized and is being integrated in more population-based studies, large studies with QSM data acquisition such as MESA is still ongoing. This left us with a relatively small number of imaging data used for training. For our experiment, we had a ratio of validation to training data (25:75), which showed to be reasonable to ensure that a maximal amount of the available data is used in model training, while at the same time a sufficient amount is reserved for within-training validation. The use of similar sample sizes for training and evaluation is not unprecedented in such small lesion detection 27,29,40,41 . One limitation of using such a small sample size is a reduction in study statistical power. For our experiments, we noted that none of the multiple comparisons were statistically significant, and this could likely be due to the small sample size. Finally, the limited access to QSM from other studies left us to perform cross-validation 37 with samples from only the MESA AFib cohort for evaluating our model. We were therefore not able to test the generalizability of our model with images generated in other studies with different parameters and characteristics. This line of work should be considered in future research efforts, ultimately building machine learning models and benefiting from pooling imaging data from multiple cohort-based studies 42 .
We have presented a framework for the automated detection of cerebral microbleeds and nonhemorrhage iron deposits in the basal ganglia. While SWI remains the preferred modality of choice for CMB detection, few studies have leveraged QSM as an additional source of information to improve overall detection accuracy, and to date, there have been no attempts to include iron deposits in the basal ganglia as an item of interest. We have utilized QSM in this study to confirm that these focal lesions in the basal ganglia are in fact iron depositions, rather than mineralization such as calcifications. Our deep learning neural network model is flexible and at the same time scalable to include additional modalities and/or class labels while maintaining comparably high sensitivity and precision. We aim in our future work to automatically detect other small vessel disease lesions in our framework such as enlarged perivascular spaces. We also aim to investigate possible advantages of expanding our network to a three-dimensional variant.

MRI Acquisition and Pre-Processing
The MESA Exam 6 Atrial Fibrillation (AFib) Ancillary Study's 34 brain MRI protocol included T1weighted (T1w) and T2-weighted (T2w) sequences, and a susceptibility weighted imaging (SWI) sequence with 4 different, equally spaced echo times. SWI is a high-resolution, 3D imaging sequence where the image contrast is enhanced by combining magnitude and phase image data 43,44 . The scans were acquired at 6 sites using the same acquisition parameters. All scans were performed on Siemens MR scanners (2 Skyra with a 20-channel head coil and 4 Prisma Fit with a 32-channel head coil) at a static magnetic field strength of 3 Tesla and identical imaging sequence parameters, as shown in Supplementary  43,45 . A homodyne high-pass filter with k-space window size of 64 x 64 was applied to the raw phase image to generate the negative phase mask (with values between 0 and 1). The phase mask was then raised to power 4 and multiplied with the magnitude image to generate the SWI. For creation of the reference annotation and subsequent deep learningbased inferencing, only the SWI image with the shortest echo time (TE=7.5 ms) was used because longer echo times have more noise due to increasingly pronounced blooming effects near the sinus cavity and cerebellum. In addition, SWI with longer echo times are also more prone to showing false positive CMBs, especially when veins are perpendicular to the imaging plane. Section 2 in the supplementary materials discuss this issue in more detail.
The T1w and T2w images underwent N4 bias correction 46 with default parameters using the implementation in the Advanced Normalization Tools (ANTs) (http://stnava.github.io/ANTs) suite and were rigidly registered to the participants' SWI image using FSL's FLIRT [47][48][49] (https://fsl.fmrib.ox.ac.uk). Anatomical parcellation and brain masks were generated with a multiatlas segmentation method using the bias-corrected T1w images 50 . These brain masks were used in the generation of the QSM images. QSM maps were generated using the entire multi-echo SWI dataset using the Morphology Enabled Dipole Inversion (MEDI) method 21,51 implemented in MATLAB (http://weill.cornell.edu/mri/pages/qsm.html). Briefly, background field removal is done using the Projection onto Dipole Fields (PDF) method 52 , followed by region-growing based spatial unwrapping with non-linear fitting 53 to reduce errors, and finally the susceptibility map is calculated using the Morphology enabled dipole inversion with zero reference using CSF (MEDI+0) method 54 .

Manual Annotation
Manual annotation was performed according to a protocol developed with the focus on highly specific differential detection of CMBs and non-hemorrhage iron deposits based on multiple modalities including QSM. The detailed protocol is described in Section 2 in the supplementary materials, and a flowchart of the manual annotation process is shown in Supplementary Figure 4 in the supplementary materials. Panel A in Figure 4 shows an example of a CMB in the thalamus and non-hemorrhage iron deposits in the interior section of the globus pallidus on SWI (for TE=7.5 ms and 22.5 ms), QSM and T2w MRI, and Panel B shows the expert segmentation of the lesions based on the annotation protocol. Panel C shows an example of a larger CMB located in the occipital lobe and Panel D shows its respective expert segmentation.

Study Participants
We included imaging data from participants in the MESA Exam 6 Atrial Fibrillation Ancillary Study 31-33 . This study was approved by the Institutional Review Boards at the MESA Coordinating Center and at each participating institution. Written informed consent was obtained by all participants. All participant data collection was performed in accordance with relevant guidelines and regulations.
A subset of the MESA cohort participated in an ancillary study of cardiac arrhythmias and brain imaging during the 2016-2018 exam (Exam 6) 34 . From 1061 participants who underwent MR brain scans, we selected a convenience sample of 34 scans based on visual identification of possible CMBs by two experienced readers (IMN and TR). These 34 participants are not representative of the MESA cohort in terms of prevalence of CMBs and non-hemorrhage iron deposits, and additional participants in the MESA cohort likely have CMBs and/or non-hemorrhage iron deposits. A total of 10 participants' scans were excluded due to poor image quality (n=4) and the presence of distortions/artifacts or motion-related effects (n=6). The demographics summary and lesion loads for the 24 included participants are presented in Supplementary Table 2.1. Of these 24 participants, there were 13 males and 11 females with age range 65-94 years. Based on the expert annotation of these 24 participants, 4 participants had no microbleeds, 13 participants had 1 or 2 microbleeds (with an average size of 10.85 mm 3 ), 6 participants had between 3 and 8 microbleeds (with an average size of 10.21 mm 3 ) and 1 participant had more than 100 microbleeds (with an average size of 4.76 mm 3 ). In certain circumstances, the participant with more than 100 microbleeds may be considered an outlier in terms of the number of CMBs. An examination of this is presented in Section 3 of the supplementary materials. Of the 24 participants, 5 participants did not have any voxels labeled as non-hemorrhage iron deposits and the remaining had between 2 (each having a single voxel or 1.5 mm 3 ) and 13 lesions (one participant had 4 non-hemorrhage iron deposit lesions with a total of 326 voxels or 489 mm 3 ) labeled as non-hemorrhage iron deposits in the basal ganglia.
The distribution of CMBs and iron deposits pooled over all participants is illustrated in Supplementary Figure 5. The average size (± SEM, or standard error of the mean) of CMB lesions in this sample was 6.27 ± 0.51 mm 3 (4.18 ± 0.34 voxels). Among the 20 participants with CMB, 70% (n =14) had two or fewer CMBs, 25% (n = 5) had between three and eight CMBs, and the remaining participant had 120 CMBs. The average size of non-hemorrhage iron deposit labels (± SEM) was 26.15 ± 4.76 mm 3 (17.43 ± 3.17 voxels). Approximately 21% (n = 5) had no discernable basal ganglia non-hemorrhage iron deposits and half (n = 12) had fewer than 100 voxels (150 mm 3 ) labeled as non-hemorrhage iron deposits. The remaining 29% (n = 7) had more than 100 voxels labeled as non-hemorrhage iron deposits.

Method Overview for Automated Processing
We developed a deep learning framework for automatic segmentation of CMBs and nonhemorrhage iron deposits based on the U-Net 35,55 , a widely used deep learning architecture for image segmentation. Our architecture, however, employed padded instead of unpadded convolutions and operated on six instead of five spatial resolutions, and was used for both single class and multiclass segmentation experiments. The larger number of resolution layers enabled the model to detect small CMBs. A detailed description of our implementation is presented in the following sections. The overall system pipeline is shown in Figure 5. After the initial step of coregistration, the MR volumes were preprocessed to have zero mean and unit variance, as detailed in Section 4.6. The normalized MR volumes were then sliced along the z-axis (axial slices) and edge-padded to obtain 2D slices with 256x256 voxels. We evaluated the performance using leave-one-out cross-validation for the 24 participants listed in Supplementary Table 2.1 to ensure generalization of results. In each fold, a single participant's data was kept separate for testing (test dataset), and the MR data and labels from the remaining 23 participants were randomly split into training dataset (75%, consisting of 17 participants) and validation datasets (25%, consisting of 6 participants). Both training and validation datasets were augmented to improve the robustness of the deep learning models (for more details on data augmentation see Section 4.7). The training dataset was used to train the model for a single epoch, after which the validation dataset was used to compute a commonly used evaluation metric known as intersection-overunion (IoU) which quantifies the amount of overlap between the predicted and ground truth segmentations. Each model was trained for a maximum of 30 epochs, and the best model was determined as the model with the maximum IoU. This best model was then used to predict the labels of the test dataset. The set of predictions used for evaluating model performance thus consisted of 24 segmentation masks that were predicted with 24 different models with no overlap between training, validation and testing datasets. These cross-validated evaluations were done for both single class and multiclass experiments. For all experiments, four permutations of MR modalities were considered: (1) SWI only, (2) SWI and QSM, (3) SWI and T2w, and (4) SWI, QSM and T2w.
For single class experiments, separate models were trained and evaluated for (1) CMBs only and (2) non-hemorrhage iron deposits only. For multiclass experiments, both CMBs and iron deposits had separate labels and were segmented simultaneously. For multiclass segmentations, a larger number of augmentations were used than for single class segmentations.

2D U-Net with Padded Convolutions
Our lesion prediction models are based on the U-Net 35 . Both single and multiclass models consist of an analysis path (down-sampling operations) with five stages of convolution blocks and pooling, followed by a five synthesis path (up-sampling) with five stages of up-convolutions, plus a convolutional block. Each downsampling block consists of two layers of a 2D padded convolution layer having kernel size of 3x3 and stride of 1x1, followed by Batch Normalization and ReLU activation. The downsampling block ends with a 2x2 max pooling layer which reduces the resolution feature map by half in every spatial direction. The central block consists of two instances of padded 2D convolution with kernel size 3x3 and stride 1x1, followed by Batch Normalization and ReLU activation. Each upsampling block passes its input data through a 2D transpose convolution with kernel size of 2x2 and stride 2x2 in order to double the size of the feature map. This doubled feature map is then concatenated with the feature map (same size) of the corresponding analysis stage (i.e. the feature map before max pooling layer), followed by two instances of a padded 2D convolution layer having kernel size 3x3 and stride 1x1, followed by Batch Normalization and ReLU activation. Due to the use of padded convolutions throughout the model, the input and output image sizes are the same (256x256). The smallest downsampled image size is 8x8 in the central convolution block.
In the case of the single class prediction model, the output of the final upsampling stage passes through a 2D convolution layer with kernel size 1x1, stride 1x1 and Sigmoid activation function. For the multiclass prediction model, the output of the final upsampling block is passed through a 2D convolution layer with kernel size 1x1, stride 1x1 and ReLU activation function, and then through a SoftMax layer to generate class probabilities. The model architecture is depicted in Figure 6. We employed random translations, random rotation, and flipping along the left-right axis during training. The network was trained with the cross-entropy loss.

Image Preprocessing
Each input image was normalized to have zero mean and unit variance. For QSM images, an additional prior step truncated the overall intensity such that the voxel value (VQSM) was within the range [− * ≤ ≤ * ], where k = 5 and is the standard deviation for the QSM image. This step is necessary because QSM images contain high-intensity noise (especially around the boundary of the brain and the region proximate to the sinus cavity) which may deemphasize the intensity of the rest of the brain.

Data Augmentation
To improve the robustness of the deep learning network and include more training data we enriched the training and validation datasets with augmentation. Axial slices containing CMBs and iron deposits are, for the most part, few compared to the remaining slices in a given brain volume. This type of class imbalance may bias the training process. To address this, data augmentation was performed on slices selectively instead of all slices, inspired from the concept of random over-sampling (ROS) and random under-sampling (RUS) 56 . First, all slices containing the labels of interest (i.e. CMBs and/or iron deposits) are augmented. Then a number of the remaining slices are randomly selected and augmented in the same manner until the total number of slices containing the labels of interest and the total number of slices that do not contain any labels of interest is similar.
Data augmentation consisted of geometric transforms such as translations, rotations and image mirroring. In each experiment, the axial SWI slice (along with the corresponding axial QSM and T2w slices) and corresponding axial reference annotation slice were augmented. For translations, a set of two random integers tx and ty (representing the amount of shift per axis) were generated within the range [-45, 45] and used to translate the image slice(s) and the corresponding slice of the reference annotation. This range was chosen empirically so that most of the brain would be visible in the translated image. A total of 10 random integers per axis were generated for multiclass experiments.
For rotations, a set of random integers d (representing the rotation in degree) were generated within the range [1,60], and the image slice(s) and the slices with reference annotations were rotated using both +d and -d. The regions of the crops that were located outside the image matrix were padded with edge values. A total of 16 random integers were used for multiclass experiments.

Evaluation of Performance
In single class models, the segmentation output map was in the range [0, 1]. Segmentations were accepted or rejected by applying a threshold value of 0.5 to the output map. In multiclass models, the model output was passed through a SoftMax function and segmentation labels were determined based on the class having the highest probability.
We evaluated the performance in terms of the rate of detected/missed CMBs and nonhemorrhage iron deposit lesions. For each participant, the number of true positives (TP), false positives (FP) and false negatives (FN) were counted. A connected-component filter with 3D connectivity was applied to both the predicted segmentation and the reference segmentation in order to identify clusters of voxels. The centroid of the lesion in both the predicted segmentation and reference annotation was computed. TP, FP and FN were determined on whether the Euclidean distance between the centroid of each predicted lesion and a reference lesion was below a specified tolerance. Since CMBs are generally assumed to be relatively small in size, a tolerance of 3 was used for evaluating CMBs, and a tolerance of 5 was used for evaluating nonhemorrhage iron deposits since iron deposits have a larger size and more dispersed pattern than CMBs which are spherical. The sensitivity (or true positive rate) was computed as the ratio of TP and number of lesions in the ground truth (TP + FN) for each participant: The precision (or positive predictive value) was computed as the ratio of TP and the number of lesions in the predicted mask: When the true negative (TN) is available, the typical measure of performance is the overall accuracy, determined by To evaluate the performance of each model, we report the average sensitivity across all participants and average precision across all participants, as well as a combined metric (magnitude accuracy) computed as √ ̅ 2 + ̅ 2 , where ̅ and ̅ are the average sensitivity and precision respectively.

Statistical Analysis
Due to the small sample size and potentially non-uniform distribution of the models' sensitivity, precision and magnitude accuracy, we utilized the non-parametric two-tailed Wilcoxon signed rank test 57 to check for any difference between the performance of the various models. In all experimental evaluations, the model trained with only SWI was considered as the baseline model for comparison. Statistical significance was considered at a p < 0.05. Correlation (Pearson's) between the prediction and reference annotation is also calculated. For CMBs, the correlation was calculated using the number of lesions, and for non-hemorrhage iron deposits, the volume was used. All statistical analyses were performed in MATLAB R2017b.      Overview of the split for one-fold of the cross-validation process that is repeated n times. In each fold, the model that was used to predict the test participant was trained on the remaining n-1 samples in order to avoid data leakage. Within the training stage, 25 percent of the n-1 participants were used as the validation set. The model with the highest validation accuracy was chosen to predict the left-out participant sample.