Deep learning with diffusion MRI as in vivo microscope reveals sex-related differences in human white matter microstructure

Biological sex is a crucial variable in neuroscience studies where sex differences have been documented across cognitive functions and neuropsychiatric disorders. While gross statistical differences have been previously documented in macroscopic brain structure such as cortical thickness or region size, less is understood about sex-related cellular-level microstructural differences which could provide insight into brain health and disease. Studying these microstructural differences between men and women paves the way for understanding brain disorders and diseases that manifest differently in different sexes. Diffusion MRI is an important in vivo, non-invasive methodology that provides a window into brain tissue microstructure. Our study develops multiple end-to-end classification models that accurately estimates the sex of a subject using volumetric diffusion MRI data and uses these models to identify white matter regions that differ the most between men and women. 471 male and 560 female healthy subjects (age range, 22–37 years) from the Human Connectome Project are included. Fractional anisotropy, mean diffusivity and mean kurtosis are used to capture brain tissue microstructure characteristics. Diffusion parametric maps are registered to a standard template to reduce bias that can arise from macroscopic anatomical differences like brain size and contour. This study employ three major model architectures: 2D convolutional neural networks, 3D convolutional neural networks and Vision Transformer (with self-supervised pretraining). Our results show that all 3 models achieve high sex classification performance (test AUC 0.92–0.98) across all diffusion metrics indicating definitive differences in white matter tissue microstructure between males and females. We further use complementary model architectures to inform about the pattern of detected microstructural differences and the influence of short-range versus long-range interactions. Occlusion analysis together with Wilcoxon signed-rank test is used to determine which white matter regions contribute most to sex classification. The results indicate that sex-related differences manifest in both local features as well as global features / longer-distance interactions of tissue microstructure. Our highly consistent findings across models provides new insight supporting differences between male and female brain cellular-level tissue organization particularly in the central white matter.


Study population
The study includes 1031 healthy adult subjects (age range, 22-37 years) from the Human Connectome Project (HCP-Young dataset) 38 , whereby sex labels were collected through self-reporting and no subject was found to have different self-reported sex from genetic sex.Institutional review board approval and participants' informed consent were obtained at the participating institutions.Demographic details are summarized in Table 1.

Diffusion MRI
Diffusion MR images were collected on a 3T scanner (Connectome Skyra, Siemens Medical Solutions, Erlangen, Germany) and preprocessed as per HCP protocol 38,39 .In brief, diffusion imaging was performed with the following parameters: 3 b-values (1000, 2000, 3000 s/mm 2 ), 90 diffusion orientations per shell, 18 b 0 (b-value = 0) images, 1.25 mm isotropic image resolution, field of view = 210 mm, number of slices = 111, TR/ TE = 5520/89.5ms, each scan was repeated along 2 phase encoding directions (RL/LR), details can be found in www.nature.com/scientificreports/HCP dataset 38 .The diffusion data was preprocessed by HCP for correction of artifacts like motion and eddycurrents artifacts 39 .We use a in-house image processing tool to generate diffusion metrics 40 , including fractional anisotropy (FA), mean diffusivity (MD) and mean kurtosis (MK) to assess white matter microstructure.FA and MD are included because they are the two most commonly used diffusion metrics for characterization of tissue microstructure in many studies 41 .Of note, FA measures directionality of water movement in brain tissue, known to be sensitive to microstructures such as axons and myelin 42 ; and MD measures mean water diffusivity, sensitive to characteristics like cellularity 43 .Here, we also include MK derived from diffusion kurtosis imaging (DKI) to compactly represent non-Gaussian behavior of water molecules as a measure of tissue complexity 44 .All metrics are registered to the FA template in the MNI space 45 using FMRIB Software Library (FSL) 46 so as to remove effects of any macroscopic anatomical differences such as size and contour of the brain itself.

End-to-end classification models
This study employs three major model architectures: 2D convolutional neural network (CNN) 47 , 3D CNN [48][49][50] , and 3D vision transformer (ViT) 51,52 .We choose these end-to-end deep networks that act on the entire image volume to avoid any reliance on hand-crafted features and/or complicated feature engineering.In general, CNN and ViT show state-of-the-art performances broadly across image classification tasks.The two architectures have their own strengths and may be complementary: CNN has inductive bias by design such as locality and translation equivalence/invariance (with/without pooling), making such a model generally more sample-efficient and easier in theory to capture local features of an image or volume 53 .While ViT lacks the inductive bias from convolutional layers rendering them somewhat more data-hungry, ViT has strengths that CNNs lack in being able to capture long-range interactions and more global features present in an image or imaging volume 54,51 , which could be important for capturing potential sex differences exist as long-range interactions.Although 3D CNN may be an intuitive choice of architecture to handle a 3D imaging volume, a 3D CNN requires more parameters and more training samples compared with a 2D CNN.Thus, we also test the performance of a 2D CNN with a lighter feature extraction backbone and greater training efficiency.

2D convolutional neural network
In this work, we use a ResNet18 47 as a 2D CNN backbone for feature extraction.Here, the 2D network essentially receives input from a small 3-slice subvolume as ResNet18 is designed to receive color images with 3 channels (RGB).We extract features from every 3 consecutive slices and combine features from all non-overlapping 3-slice subvolumes for the prediction head for classification (Fig. 1).Specifically, given input volumetric data with the shape of S × H × W (S: slice number, H × W: slice size, with each slice in sagittal view), we generate S /3 2D 3-channel images each with the shape of 3 × H × W .The same ResNet18 is applied to extract features from each 3-slice subvolume and features from all S /3 3-slice subvolumes are concatenated as the input to a linear prediction head.The ResNet18 architecture is shown at the bottom of Fig. 1.The input is fed to a convolutional layer (conv layer) (kernel-size = 7 × 7, stride = 2, channel-number or number-of-feature-maps = 64), followed by a max-pooling layer for further downsampling (kernel-size = 3 × 3, stride = 2).After the pooling, 8 convolutional layer blocks called residual blocks (where input to the block is added to the output via residual short-cut connection) are applied where each block contains 2 convolutional layers with kernel-size = 3 × 3, the channel number gets doubled and the spatial size gets downsampled by 2 at the first conv layers of 3rd, 5th, 7th residual blocks.Each conv layer is followed by batch-normalization 55 and ReLU activation 56 .In the end of ResNet18, global-average pooling is applied to each feature map to generate a single feature value, leading to 512 features for each 3-slice subvolume.Given SxHxW=180 × 224 × 224, we have S/3 = 60 3-slice subvolumes with each yielding 512 features.These 60 × 512 features are concatenated and fed to a linear layer for final prediction, which is a fully-connected layer mapping 60 × 512 features to the predicted class score.

3D convolutional neural network
We employ 3D ResNet-10 48-50 as our 3D CNN backbone, with architecture shown in Fig. 2. The 3D volume is firstly fed into a conv layer (kernel-size = 7 × 7 × 7, stride = 2, channel = 64) followed by a max pooling layer (kernel-size = 3 × 3 × 3, stride = 2), 8 residual blocks are then used with each block having 1 conv layer.The channel number is doubled at residual block 3, 5, 7, with stride set as 2 for block 3 and dilation set as 2 for block 5 and set as 4 for block 7.Each conv layer is followed by group-normalization 57 and ReLU activation 56 .In the end, global average pooling is applied to map 512 feature maps to 512 feature values and one linear layer is used for the final prediction, which is a fully-connected layer mapping 512 features to the predicted class score.

Vision transformer for 3D input pretrained with mask autoencoders
The original 2D ViT 51 is extended to extract features from a 3D volume.Given input 3D diffusion metric x ∈ R S×H×W , the data is reshaped into a sequence of flattened non-overlapping 3D patches x p ∈ R N×(s•h•w) , where ( S, H, W ) is 3D volume size and ( s, h, w ) is the 3D patch size, patch number is defined as N = SHW/shw .As shown in Fig. 3, for each 3D patch, a linear layer is applied to map voxel values to a latent embedding with dimension D .A learnable positional embedding with same dimension D representing each token's location, is added to the original embedding.The resulting sequence of embeddings for all N patches are fed to the encoder consisting of L alternating layers of multi-head attention and Multi-layer-perceptron (MLP) blocks.
A classification token with dimension D is appended to the input embedding sequence, which is designed as a latent representing the entire input sample.The output embedding of the classification token is then fed into a linear prediction head to generate a prediction.In our study,  58 , where a specific ratio of patches, defined as r , is randomly masked and a ViT encoder and auxiliary decoder are trained to predict the values of r × N masked patches from (1 − r) × N unmasked patches.After pretraining, the encoder is finetuned for the target sex classification task with all N patches fed into it.Since 3D patches are more difficult to predict than 2D patches (especially given the small number of available 3D volumes), we pretrain a 2D ViT encoder with MAE on 2D slices first and use the resulting weights to initialize our 3D ViT model for 3D patches,     www.nature.com/scientificreports/ it powerfully captures local features within the volume, a recent study showed that even very deep CNNs with a high number of parameters still have only small effective receptive fields 54 , meaning they rely primarily on their ability to learn local features as opposed to longer distance relationships.On the other hand, ViTs capture global features more readily 51 and the incorporated MAE pretraining task used here also heavily focuses the model on inter-patch correlations than span some distance across the volume.The study finds that both the 3D CNN and ViT models performed very well, suggesting that there are both short-distance and long-distance interactions that show differences in terms of sex-related patterns of white matter microstructure.The 2D CNN achieved overall best performance for 2 out of 3 diffusion metrics studied.This is felt to be attributable to the fact that the 2D CNN incorporates a design that may allow it to simultaneously capture both local features and global interactions (across all slices), thus making it able to leverage both types of features in the classification task.Specifically, the ResNet18 extracts features from every group of 3 consecutive slices allowing the model to learn from within-slice features and short-range inter-slice features across the 3 slices; by then concatenating features across all 3-slice subvolumes (as opposed to averaging across them as is most commonly done) the model here effectively preserves local features from every 3-slice partition while at the same time, the prediction head is able to learn more global interactions across 3-slice subvolumes.It's also possible that the simplicity of the 2D CNN model (it had the lowest number of parameters, as shown in Table 4) helps to push its generalization capability; although differences between test performance were nominal across all three models, which suggest that they are all comparably generalizable.
The occlusion results show general consistency across models and across diffusion metrics and implicate central white matter tracts and ventral/dorsal hindbrain tracts in contributing to sex-related differences, though results differ slightly across diffusion metrics and models.The number and fractional volume of WM regions significantly contributing to sex classification was highest for 2D CNN (mean number of regions: 15; mean fractional volume: 0.79) compared with 3D CNN (mean number of regions: 2; mean fractional volume: 0.24) and ViT (mean number of regions: 7; mean fractional volume: 0.16), possibly again reflecting differences in the relative facility of these models to tap short-range interactions, long-range interactions, or both.Across the  www.nature.com/scientificreports/three diffusion metrics, it appears that the 3D CNN classifier focused consistently on large central white matter structures such as the middle cerebellar peduncle and corpus callosum whereas the ViT and 2D CNN models tended to rely on a greater diversity of white matter regions.Another observation is that the corpus callosum was found to be important across all neural network architecture types and diffusion metrics.As sex-related regional brain structure differences have been particularly controversial [17][18][19] , our work provides new evidence that sex differences do in fact exist in focal regions such as the corpus callosum.
Of note for the ViT, pre-training with MAE was important.ViT is a data-hungry architecture and difficult to train with a limited dataset since it lacks inductive bias such as the locality and translation invariance of CNNs.The MAE pretraining task (to predict masked patches from visible patches) enables the model to learn interpatch interactions without supervision from data labels.The random masking itself also introduces data diversity to the pretraining, which helps further improve the generalizability of learned features.The benefit of MAE pretraining is clearly demonstrated in the experimental results: without pretraining, ViT trained from scratch yielded much lower performance with test AUC < 0.80.The improvement of each of the other models compared to ViT trained from scratch is statistically significant, with p < 0.05 achieved with the Wilcoxon signed-rank test that compares the predicted probability of the correct class.With MAE pretraining, the ViT encoder achieved test AUC 0.94-0.96.The end-to-end supervised finetuning brought no additional gain and achieved comparable performance with linear probing, confirming that the size of the training set is insufficient to tune a data-hungry ViT in supervised end-to-end training.
Limitations include the use of only three representative diffusion metrics, though these are chosen based on the fact that they are the most common and easily obtained using a well-established diffusion kurtosis imaging acquisition.Further exploration using modeled diffusion metrics 26 may yield additional information about sex-related differences in tissue microstructure and help us continue to characterize the underlying biophysical differences between brains of males and females.Additionally, our study is based on DWI with moderate b-value, future study can include datasets with high b-value that are more sensitive to restricted diffusion 59 .Recognizing that the age distribution differs between the female and male cohorts (with the female group having more older people) (Table 1), we have separately evaluated the model accuracy on three subgroups broken down by age and found the performance to be comparable across all age groups.For a narrow-band of adults ages 26-30 where aging changes are likely to have little influence, our models continue to achieve high sex classification accuracy.In future work, combined with additional diffusion metrics such as the modeled diffusion metrics 26 , the study can be extended to examine the sex differences in age ranges other than young adults, which can shed light on how sex differences progress in life span.Finally, occlusion analysis used the standard JHU-ICBM-1 mm atlas for white matter parcellation which has sizable variation in size of regions which could potentially bias regional importance; any affect region size has on region importance is limited, however, as the results for relative importance of regions does not correlate well with region size.Besides, our 2D CNN is based on sagittal slices, in future work, the new 2D CNN can be designed to leverage information from all three views.This study does not use the 3 diffusion metrics as combined input to neural networks due to GPU memory constraint; neural network architectures that can efficiently take multiple 3D volumes as combined input can be explored in the future work.
Overall, our results provide new evidence and insights to support that sex differences exist in human brain microstructure both in local features (e.g., within central white matter structures such as the middle cerebellar peduncle and corpus callosum) and in global features (like long-distance interactions).Capturing complex microstructural differences is challenging using conventional statistical methods or single-approach machine learning network inputting handcrafted features.Our work demonstrates a unique approach: that by leveraging multiple neural networks with completely different architecture design, it allows us to capture complementary information and makes our results independent of model architecture choices.When it comes to using advanced machine learning architectures that are data-hungry, self-supervised learning can be used to pretrain models and enable such neural networks to be leveraged for medical imaging studies that lack tremendous datasets such as those available for non-medical, computer vision work.Such a framework can be further adapted to study other neurological disorders.

Conclusion
This study provides new evidence of clear sex-related differences in brain white matter microstructure of healthy young adults detected using in vivo diffusion MRI without hand-crafting or manually manipulating the imaging data.We show this across 3 different end-to-end deep neural networks and 3 commonly used diffusion MRI metrics.Even after registering diffusion MR volumes to a template so as to remove macroscopic anatomical differences such as overall brain size and contour, results show that sex differences exist in diffusion anisotropy (FA), mean diffusivity (MD) and tissue complexity (MK) of brain white matter.Our experiments further suggest there are both local as well as longer-distance microstructural organizational features that differ between sexes.In particular, the central white matter appears specifically implicated.In addition, this study provides a framework to study microstructural differences in the human brain using multiple deep neural network architectures, which help capture complex microscopic features challenging for statistical methods.Further study is needed to determine whether and how these microstructural differences influence brain health and disease in both men and women.

Figure 1 .
Figure 1.Our 2D CNN model.In the top of the figure, the imaging volume is divided into subvolumes, and a shared ResNet18 is applied to extract 512 features from each subvolume.The features are concatenated and fed to a linear layer for the final prediction.The bottom of the figure shows the architecture of ResNet18 (residual connection, ReLU activation, batch normalization are omitted for simplicity): The input is first fed into a convolutional layer (7 × 7 kernel-size, stride = 2, channel-number = 64) followed by a max-pooling (kernelsize = 3 × 3, stride = 2) layer; subsequently, 8 residual blocks are applied with each containing 2 convolutional layers.Residual blocks parameters: conv layers in block 1, 2 have kernel-size = 3 × 3 and channel = 64; conv layers in block 3, 4 have kernel-size = 3 × 3 and channel = 128; conv layers in block 5, 6 have kernel-size = 3 × 3 and channel = 256; conv layers in block 7, 8 have kernel-size = 3 × 3 and channel = 512; stride = 2 is applied at the first conv layer of block 3, 5, 7. Global average pooling is applied at the end.The classification head is made with one fully-connected layer.

Figure 3 .
Figure 3. Vision Transformer for Diffusion MRI sex classification: the imaging volume inputted is partitioned into non-overlapping patches.Each patch is projected to patch embedding using a linear patch embedding layer, and added with positional embedding representing the position of the patch.A classification token is appended to the sequence of tokens to learn representation of the entire input sample.The structure of the transformer encoder is shown on the right, which consists of L alternating layers of multi-head attention and multiple-linearperceptron (MLP) blocks.After the transformer encoder, the corresponding output of the classification token is fed to the classification head to generate prediction results.In our study, a 1 fully-connected layer is used as the classification head.The pretraining of the ViT is shown in the lower part of the figure, with details of the pretraining included in the supplementary information.The ViT encoder from the pretraining is used as the Transformer Encoder as the backbone for the sex classifier shown in the upper part of the figure.

Table 4 .
The parameter number of three model architecture.