Deep joint learning of pathological region localization and Alzheimer’s disease diagnosis

The identification of Alzheimer’s disease (AD) using structural magnetic resonance imaging (sMRI) has been studied based on the subtle morphological changes in the brain. One of the typical approaches is a deep learning-based patch-level feature representation. For this approach, however, the predetermined patches before learning the diagnostic model can limit classification performance. To mitigate this problem, we propose the BrainBagNet with a position-based gate (PG), which applies position information of brain images represented through the 3D coordinates. Our proposed method represents the patch-level class evidence based on both MR scan and position information for image-level prediction. To validate the effectiveness of our proposed framework, we conducted comprehensive experiments comparing it with state-of-the-art methods, utilizing two publicly available datasets: the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Australian Imaging, Biomarkers and Lifestyle (AIBL) dataset. Furthermore, our experimental results demonstrate that our proposed method outperforms the existing competing methods in terms of classification performance for both AD diagnosis and mild cognitive impairment conversion prediction tasks. In addition, we performed various analyses of the results from diverse perspectives to obtain further insights into the underlying mechanisms and strengths of our proposed framework. Based on the results of our experiments, we demonstrate that our proposed framework has the potential to advance deep-learning-based patch-level feature representation studies for AD diagnosis and MCI conversion prediction. In addition, our method provides valuable insights, such as interpretability, and the ability to capture subtle changes, into the underlying pathological processes of AD and MCI, benefiting both researchers and clinicians.


Related works
CNN-based Alzheimer's disease diagnosis. The development of deep learning methods, including CNN, has efficiently addressed multistep pipelines for handcrafted feature generation/extraction and logistic regression by training a model in an end-to-end manner 17 . Thus, studies on the accurate AD diagnosis based on the 3D CNN are underway by taking 3D whole-brain images as input 13,20 . In particular, various architectures have been proposed for accurate AD diagnosis 9,13,17,23,24 . Among all, 13 demonstrated the changes in disease identification performance according to various factors, such as the normalization layer, kernel size, network architecture width, and patient age. For network architecture, the model introduced in 13 was designed to be capable of learning subtle differences in the brain by avoiding early spatial downsampling and limiting the size reduction of feature maps during low-level feature extraction steps. Afterward, the attention mechanism gradually became popular and widely employed in the CNN-based image recognition model to better explain network behavior and generate more discriminative feature representations 19,43,44 .
In terms of AD diagnosis and analysis, 21,23 introduced an attention-based 3D convolutional network for disease identification and biomarker exploration. The network architecture was designed based on the ResNet 45 , and the attention module was embedded in the middle of the network. By taking the extracted local features as input, the attention module produced spatial weights. As described in 21 , the goal of the attention module is to represent the regional importance during end-to-end training. In backward propagation, the produced attention could work as a gradient filter. Recently, 31 devise a method that combined multiview-slice attention with 3D CNN to effectively analyze MRI data. By focusing on multiple views of the slice brain image and leveraging the 3D information, the model enables to improve the AD diagnosis performance. As the early AD stages could only be identified using subtle local pathological cues, patch-level feature representations have been investigated to capture subtle local pathological changes more efficiently. Given that the early stages of AD rely on subtle local pathological cues for identification, some researchers have explored patch-level feature representations to enhance the detection of these subtle changes. However, since only a limited number of brain regions may contain relevant cues for disease identification, an additional discriminative patch extraction procedure was necessary before analysis. In previous studies, a common initial step for extracting discriminative patches involved leveraging prior anatomical knowledge and utilizing statistical approaches to derive the discriminative probability 33 . Patch extraction has been improved through resampling schemes using Elastic Net 29,46 . Recently, a landmark discovery algorithm for AD diagnosis was introduced for discriminative patch extraction 25 . The algorithm started with a multivariate statistical test on training images generated using nonlinear registration. A p-value map was obtained from the template space. Based on the p-value map, landmarks were determined based on the size and number of patches and the minimum distance between patches. Patches extracted from the landmark were used for diagnosis model training. 3D CNN-based feature extractors were configured as the number of determined landmarks to extract features from each patch located in the landmark. The following fully connected layers employed concatenated patch-level representations for bag-level prediction.
However, previous studies conducted discriminative region localization and patch extraction independent of the diagnostic model, which may lead to suboptimal diagnostic performance 26 . To alleviate this limitation, 26  www.nature.com/scientificreports/ proposed a hybrid loss and pruning strategy based on the H-FCN. The proposed model extracted multiscale feature representations (i.e., patch-, region-, and subject-level features) by employing CNN. This hierarchical construction of the network architecture allows the trained model to identify the most informative patches and regions through hybrid loss. The hybrid loss was designed by gathering patch-, region-, and image-level loss for diagnostic model training and ranking the discriminative capacity of the corresponding location. Pruning less discriminative areas identified by hybrid loss could improve diagnostic results. However, the hybrid loss was defined by considering all patch-and region-level features belonging to the patient's MRI images as positive samples, although not all patches and regions would necessarily be affected by AD. In addition, this approach still relies on predetermined landmarks for better classification performance in the initial stage, and the pruning approach cannot consider potential patch-level biomarkers not previously extracted as candidates.
More recently, a hybrid network (HybNet) 22 was proposed to improve the H-FCN, which considers both global and local structural information. Specifically, two branches were constructed: the global branch (GB) and the local branch (LB). The subject-specific and intersubject-consistent discriminative region localization approaches were applied in the GB and LB, respectively. Both localizations were obtained using a pre-trained fully convolutional network (FCN) backbone. The FCN backbone was trained for weakly supervised object localization (WSOL) and capable of generating class activation maps 47 . They utilized disease attention maps (DAMs) derived from the class activation map outputs to represent subject-specific brain regions associated with AD. Furthermore, the mean of the DAMs is calculated by averaging the DAMs generated from a substantial number of samples in the training dataset. Leveraging the pre-computed localization results, the GB and LB models were trained. The GB utilized DAMs as spatial attention, representing subject-specific discriminative brain regions. On the other hand, LB was trained using patches extracted from intersubject-consistent discriminative brain regions, as indicated by the mean DAM. The patch extraction was performed identically to the H-FCN, but the mean DAM was used instead of the discriminative probability map obtained by the statistical test. Moreover, this study indicated that the feature representations extracted from both branches could be complementary, and their fusion could improve classification performance. Although additional information was used to address the shortcomings of H-FCN, the predetermination of patches may hamper the effectiveness of end-to-end learning of local feature extraction and diagnosis. A recent study 30 introduces a patch-based deep multi-modal learning (PDMML) framework for diagnosing brain diseases. The authors incorporate a discriminative location discovery strategy to remove normal regions without prior knowledge. Moreover, the framework integrates information from different imaging modalities and jointly trains the model to preserve spatial information that would be lost by directly flattening the patches. As a result, the proposed model enhances the accuracy of Alzheimer's disease diagnosis.
AD-associated brain-region localization. AD-associated brain-region localization methods have been proposed and developed for various purposes. Specifically, these methods boost the classification performance of diagnostic models, detect potential biomarkers in AD diagnosis, and better explain the behavior of deep learning networks. These brain-region localization methods can be divided into two categories according to the information used in the localization: feature-based and position-based approaches.
First, feature-based approaches produce the brain-region localization result based on individual local features extracted from each brain image. These can be further divided into supervised learning and weakly supervised learning approaches based on the learning strategy. For supervised learning approaches, a relatively large patch or ROI extracted from the image was assigned the same annotation as the image-level annotation 15,48 . A model was trained to represent regional abnormalities by subject. Regional outcomes were used as features to estimate the individual disease states. Although the identified regional abnormalities can provide clinical evidence, the evidence was relatively coarse. In addition, this supervised approach assumes that all regions in patients are affected by AD. This fact has recently led to the application of weakly supervised learning and MIL.
Regarding weakly supervised learning approaches 47 , proposed a representative WSOL method through an FCN with a global average pooling (GAP) layer and linear classifier. Due to the linear property of the GAP and linear classifier, areas that contributed significantly to the predictions can be tracked. This approach has evolved from several perspectives 18,49 , and 50 proposed bag-of-local-feature models, which provide patch-level class evidence by limiting the receptive field size of the topmost feature maps. In the AD study, WSOL was employed in 26 to localize AD-related structural abnormalities at a finer scale by training an additional 3D FCN. Moreover, WSOL was applied to represent the regional importance of better feature representation 22,51 . Li et al. 51 proposed an iterative learning framework leveraged by the localization result generated by WSOL. Further 22 , introduced a subject-specific discriminative brain-region localization called a DAM. An attention mechanism can be attached to a diagnostic model for a similar purpose. Jin et al. 21 and Zhu et al. 52 proposed an attention-based diagnosis model for joint learning of discriminative brain-region localization and disease identification.
Unlike feature-based region localization, position-based brain-region localization methods detect regions where significant differences appear between the AD and NC groups. The identified brain regions are consistent across subjects, leading to a method called inter-subject-consistent discriminative region localization 22 . All sMRI scans are aligned to the same template in preprocessing; thus, all samples share the same 3D space. This shared space allows a group comparison of local features, and the statistical test could generate a probability map representing the discriminative capacity. In particular, this position-based localization method has been widely used in patch extraction for patch-level feature representation. Data-driven pathological brain-region localization approaches have continued evolving as described in the previous section such as the statistical approach 25,28 , landmark discovery 25 , pruning strategy 26 , and mean DAM 22 . However, the existing patch extraction methods are performed independently of image-level diagnostic model outcomes. www.nature.com/scientificreports/ Inspired by the recent patch-level analysis in AD diagnosis, we propose a framework that jointly learns pathological region localization and disease identification in an end-to-end manner. In addition, final decisionmaking is conducted through the transparent aggregation of the patch-level responses, providing patch-level class evidence for decision-making. To the best of our knowledge, this framework is the first for joint learning of position-based discriminative brain-region localization and disease identification in an end-to-end manner.

Experiments and results
In this section, we demonstrate datasets used for performance evaluation, our experimental settings, and comparative methods related patch-level feature extraction. In addition, we report the classification accuracies of our framework and those of comparative methods.
Dataset and preprocessing. We used two publicly available datasets, namely the ADNI [1] and the AIBL [2]. The ADNI dataset is a renowned and extensively utilized resource in the field of Alzheimer's disease research. It provides longitudinal data from individuals diagnosed with AD, mild cognitive impairment (MCI), and normal controls (NC) and includes various types of data, such as clinical assessments, neuroimaging scans (i.e., MRI and positron emission tomography (PET)), genetic information, and biomarker measurements. The main objective of the ADNI dataset is to expedite the progress of developing novel diagnostic methods and therapeutic approaches for AD, ultimately contributing to improved patient care. The AIBL database is a comprehensive Australian research initiative focused on investigating the early biomarkers and underlying causes of AD. It consists of a variety of data collected from individuals across the cognitive spectrum, including clinical assessments, neuroimaging scans (MRI and PET), genetics, and lifestyle factors. The dataset provides valuable insight into the development and progression of AD and supports the development of early detection and intervention strategies, and potential treatment approaches. As our main objective is to classify MRI of patients with AD as well as MCI and NC, we consider only baseline subjects. Therefore, we first collected the baseline brain sMRI scans and the diagnostic information from the datasets. Henceforth, we categorized the disease state of scans collected in all datasets (i.e., ADNI and AIBL) into three classes: NC, MCI, and AD. In this process, we further divided each MCI subject into two classes for the MCI conversion prediction task. If the patient corresponding to a baseline image had not been diagnosed with an AD class by 72 months, the image was labeled as the stable MCI (sMCI) class. On the other hand, images converted into the AD class within 36 months were labeled as the progressive MCI (pMCI) class. We noted that MCI samples with reversion from the AD class to other classes were excluded from the dataset. The demographics and clinical information are presented in Table 1.
The collected brain scans were processed using the following pipeline. First, the brain extraction procedure was performed by the HD-BET brain extraction tool 53 to remove non-brain tissues from the MRI image (e.g., neck, skull, and so on.). Then, skull-stripped images were aligned to the MNI152 template using linear registration tool (FLIRT) from the FMRIB (http:// fsl. fmrib. ox. ac. uk/ fsl/ fslwi ki) software library v6.0.1. By doing so, the images removed global linear differences such as global translation, scale, and rotation differences and further allowed them to have an identical spatial resolution (i.e., 1 × 1 × 1 mm 3 ). Consequently, we acquired the preprocessed 3D brain scans with 193 × 229 × 193 . Noted that each image was normalized through the mean and standard deviation of each image.

Experimental settings.
To validate the proposed models, we conducted five-fold cross-validation on the AD diagnosis (AD vs. NC) and MCI conversion prediction (pMCI vs. sMCI). As aforementioned, in order to compare our proposed model with comparative methods in AD diagnosis and MCI conversion prediction tasks, we first trained the model for AD diagnosis and transferred the trained parameters to initialize the network for the MCI conversion prediction task. We performed the transfer learning process because the two tasks are highly correlated, and the MCI conversion prediction task is more challenging than the other task. The classification performances acquired in five-fold cross-validation were measured regarding the accuracy (ACC) and the area under the receiver operating characteristic (AUROC), accordingly.
The proposed method required a predetermined 3D Cartesian space to represent the brain regions where local features were extracted. Therefore, we defined the 3D Cartesian space I with the size of 193 × 229 × 193 × 3 the same as preprocessed 3D brain scans as illustrated in Fig. 2a. To alleviate the overfitting problem, we adopted a random cropping strategy for data augmentation with the size of 177 × 213 × 177 in training. In the case of evaluation, we employed center-cropped images, not only in our proposed method but also in the comparison methods.
Our encoder network consists of a convolutional block, which is set to 32, and four residual blocks, which are set to 32, 64, 128, and 256 the number of output feature maps. In the position-based gating branch, output feature maps for the positional embedding network were set at 128 and 256 and were reduced to 128 and 16 in the gate network. For model initialization, all weights for the AD diagnosis model were initialized using the He initialization method 54 and optimized using the Adam optimizer 55 . We adopted cosine annealing with the learning rate warm-up method for scheduling the learning rate, referred to by 56 . Specifically, the learning rate was linearly increased from 0 to 1e −4 within five epochs and was decreased as a cosine function to a learning rate of zero. We set the total number of epochs as 200 and applied the early stopping with 30 patience with four mini-batch sizes. By using a grid hyperparameter search, we set for the weight between classification and entropy losses as 0.01.

Comparative methods.
For all experiments, we adopted four sMRI-based deep learning architectures, which are considered state-of-the-art methods: 3D CNN, Attention-based 3D ResNet (A3D-Net), DA-MIDL, and HybNet. www.nature.com/scientificreports/ • 3D CNN 13 : The CNN-based classifier was trained end-to-end to classify disease states without anatomical prior knowledge and a localization method. We adopted a proposed model architecture for a fair comparison without clinical information, such as patients' age. Given a randomly cropped 3D image as input, sequential convolutional blocks extracted local features, and the output feature maps were flattened. The flattening vector was employed as input to the classifier. We compared our method to the model trained with a widening factor (WF) of 1 and 2. • Attention-based 3D ResNet (A3D-Net) 21 : This method had a similar goal as jointly learning AD-related brain-region detection and disease identification in an end-to-end manner. However, there are two underlying differences that exist. One is that the AD-related brain regions were detected based on local features. The other difference is the non-linear interactions between weighted local features for image-level decision-making. An attention module generated spatial attention weights based on local features extracted in the middle of the network, and the attention was applied to local features. • DA-MIDL 52 : They propose a dual attention network, which represents both spatial importance for extracting discriminative features within each sMRI patch and attention for MIL pooling. The predetermination of patches has been performed by group comparison on the training set. Here, we set the number and size of patches as 60 and 25 × 25 × 25 , respectively. For feature representation, we applied two types of attention.
One was spatial attention within each patch, and the other is for attention MIL pooling. An attention-aware global classifier continues to process the bag-level representations for final diagnosis. • Hybrid network (HybNet) 22 : This method consists of two branches constructed to capture 1) global structural information and 2) local structural information. First, the FCN backbone was trained to generate the DAM and mean DAM to train a GB and LB, respectively. The DAM was directly used as the attention in training the GB, whereas the mean DAM was used to determine the brain regions to extract patches. The LB was trained based on the patches extracted from predetermined brain regions. As described in the literature, we performed the pruning and fine-tuning steps. Finally, two discriminative feature vectors obtained using the GB and LB were concatenated and used as input for the training fusion branch. The fusion branch comprised two subsequent fully connected layers followed by ReLU activation.
Performance comparison. Table 2 represent the comparison of classification performance for the AD diagnosis and MCI conversion prediction task, respectively. In the table, BrainBagNet-s processes a single patch size s, and patch-level responses are aggregated using the GAP operation. The FG-BrainBagNet indicates a BrainBagNet with a feature-based gate inspired by an attention-based MIL framework 38 . Instead of using posi-  www.nature.com/scientificreports/ tion information, local features generated by the encoder network were used as input to the gate network. Lastly, the PG-BrainBagNet is the proposed position-based gate method. In the AD diagnostic task, we first selected the patch size as a tunable hyperparameter using the mean balanced cross-entropy loss criterion of the validation set for five-fold cross-validation. The best classification model in AD diagnostic task was selected as PG-BrainBagNet-41 based on the minimum validation loss. Compared to state-of-the-art methods, we observed that our proposed method outperformed in terms of accuracy (ACC) and AUROC. The highest and lowest margins for the mean accuracy were 14.77% (vs. DA-MIDL) and 3.65% (vs. HybNet (GB)), respectively. The proposed position-based gate method (i.e., PG-BrainBagNet) yields an increase in classification performance compared to BrainBagNets regardless of patch size. Moreover, PG-BrainBagNet significantly improved the classification performance of BrainBagNet (i.e., 13.13%) with small patch size (i.e., s=9), whereas feature-based gates (i.e., FG-BrainBagNet) even had low performance when s = 57 . The proposed position-based gate method (i.e., PG-BrainBagNet) yields an increase in classification performance compared to BrainBagNets regardless of patch size. Moreover, PG-BrainBagNet significantly improved the classification performance of BrainBagNet (i.e., 13.13%) with small patch size (i.e., s=9), whereas feature-based gates (i.e., FG-BrainBagNet) even had low performance when s = 57 . Although the classification performance of Brain-BagNets increased as the patch size increased, PG-BrainBagNets did not exhibit significant differences according to patch size. Therefore, the reason that BrainBagNets performed classification poorly when using a small patch size might be that the whole-brain image contains many patches unrelated to the brain disease. In addition, we observed that the improvement in classification performance cause our proposed method (w/ gating mechanism) to capture the AD-related regions effectively. For validating the generalization of the models trained on the ADNI dataset, we additionally tested on the AIBL cohort. We could observe that the position-based gating branch (i.e., PG-BrainBagNet) improved the accuracy and AUROC score. In addition, our proposed method showed a balanced prediction result even in the case of class imbalance. Although HybNet achieved the highest accuracy, our proposed model obtained more balanced performance, outperforming the state-of-the-art models in terms of AUROC.
In the MCI conversion prediction task, the best model that appeared in our proposed method was PG-Brain-BagNet-17 based on the minimum validation loss as well. In the comparison of the results using the state-of-theart methods, the classification accuracy did not show a significant difference due to the class-imbalanced dataset which consists of the number of samples 251 (pMCI) and 497 (sMCI). On the other hand, we could observe that our proposed method significantly increased the AUROC score against the comparison method. Similar to the performance obtained in the aforementioned task, in terms of patch sizes, the classification performance of BrainBagNets increased as the patch size increased excluding s = 57 . Furthermore, the feature-based gating method (i.e., FG-BrainBagNet) did not play a role in improving the classification results. However, the positionbased gate method (i.e., PG-BrainBagNet) yielded improvements when a small patch size was used, especially when the patch size was 9, 17, 25 or 41. Except for s = 57 case, the classification results of PG-BrainBagNet were consistently increased as the patch size was reduced. As the receptive field size was limited, local feature representations were forced to extract the local brain changes rather than global structural changes. The results implied  www.nature.com/scientificreports/ the brain-region localization method based on the position provided highly informative results. In addition, we observed the importance of capturing subtle changes for the early detection of AD.
The MCI conversion prediction performance obtained from the proposed models trained from scratch and those trained through transfer learning is compared in supplementary A. Furthermore, Fig. 3 illustrates the confusion matrices of the PG-BrainBagNet-41 and PG-BrainBagNet-17 models, which exhibited the best performance according to the validation loss criterion, for the tasks of AD diagnosis and MCI conversion prediction, respectively. Each confusion matrix showcases the performance of the corresponding model on its respective task and dataset.
Visualization of discriminative brain regions. We analyzed discriminative probability maps G generated by the proposed position-based gating branch. For better visualization, linear interpolation was performed and overlaid with the MNI template. The changes in the discriminative probability map by the learning epoch are described in supplementary B. In the learning process with a small patch size, localization helped the diagnostic model extract finer features; thus, the gating branch could provide better localization results based on the features extracted by the diagnostic model. The differences in the output of the gating branch from the trained model according to the patch sizes and downstream tasks are compared in Fig. 4. First, high responses were distributed in anatomically meaningful areas such as the hippocampal, temporal, and parietal lobe areas. When the model was limited to increasing the receptive field size, the gate network represented the weight in the sparse  www.nature.com/scientificreports/ regions only. As the patch size increased, regions with high responses were captured in the overall images. In this context, the proposed method employing small patches is sensitive to localization results and requires proper localization. While there were no significant changes by downstream tasks, we observed that the highlighted regions were dispersed especially in the model trained using small patches for MCI conversion prediction.
Effect of brain region localization on diagnostic performance. We performed an ablation study of localization methods using our proposed framework to evaluate the effectiveness of joint learning of discriminative brain-region localization and disease identification. We compared four localization methods: "w/o", "mean DAM", "mean DAM (0.3)", and "end-to-end". The "w/o" denotes BrainBagNets, and the "end-to-end" denotes the proposed models, which are PG-BrainBagNets. Both "mean DAM" and "mean DAM (0.3)" were models trained with the predetermined discriminative brain region inspired by the mean DAM introduced in 22 . For "mean DAM", the proposed framework has been trained using predetermined G by considering the mean DAM to be G instead of training position embedding and gate network. The resulting model was denoted as "mean DAM". In addition, we generated a binary mask because "mean DAM" was not used as a probabilistic value but instead was used for extracting patches in 22 . In patch extraction, the threshold of 0.3 was used in the literature to represent potential patch locations. We obtained a binary mask based on this threshold, and model training was performed in the same way as for "mean DAM". The predetermined G for the "mean DAM" and "mean DAM (0.3)" are illustrated in Fig. 5a. The average accuracy (ACC) and AUROC in five-fold cross-validation are described in Fig. 5b and c. First, when the model was trained without brain-region localization, classification performance decreased as the patch size reduced. The model trained using the smallest patches exhibited the lowest classification performance for both tasks in accuracy and AUROC. By adding the predetermined localization method, the classification performance improved compared with that without the localization method. However, localization was performed regardless of the diagnosis model, resulting in a worse classification than the proposed method (i.e., "end-toend"). In the MCI conversion prediction task, only the proposed method demonstrated increased classification performance by limiting the increase in patch size. This result implies that the regularization of the patch size allows extracting AD-related local and subtle changes but requires suitable brain-region localization dependent on the diagnosis model.
Effectiveness of subtle changes captured using small patches. The aforementioned, we demonstrated that the model extracting the local class evidence captured using a small receptive field size better predicts MCI conversion. We analyzed the two pMCI samples that yielded false negatives from the model trained www.nature.com/scientificreports/ using a large patch and made predictions correctly by limiting the patch size. The local class evidence according to patch size is described in Fig. 6. The first row depicted the original sMRI scans with image ID 26442 and 31799. The following rows demonstrated the patch-level class evidence according to the patch size. For a better comparison, the areas where a high amount of class evidence was contained were marked with dashed rectangles and denoted as R1 to R4. The blue and red colors indicate high-class evidence for sMCI and pMCI classes in that region, respectively. First, in the bottom of the R1 region, we observed that Sample #26442 contains brain atrophy in the temporal lobe rather than the hippocampus, compared to Sample #31799. In contrast, Sample #31799 depicts brain atrophies located in the hippocampal area. These brain atrophies were correctly captured by the model trained using small patches. However, the estimation produced by the model trained using larger patches demonstrated the difficulties of capturing these subtle changes. These patterns can be observed in the R1, R2, and R3 regions. The positive class evidence for the pMCI class located in the parietal lobe area was only captured by the model trained using small patches, which can be observed in comparing the R4 region. Finally, the model trained using small patches could determine sufficient local evidence to correctly predict the MCI conversion, whereas models trained using large patches could not capture sufficient cues for a correct decision. In this analysis, we observed that regularizing the increasing patch size increased the prediction performance for MCI conversion by extracting subtle and local structural feature representation. To better understand these results, we demonstrated the additional visualization of patch-level class evidence for the various patch sizes and samples in supplementary C.

Discussion
From a practical perspective, there are several pros and cons to consider in our study. On the positive side, our proposed framework deviates from previous patch-level approaches by leveraging a CNN-based encoder to extract patch-level features directly from whole-brain images, leading to improved computational efficiency. This efficiency is beneficial for practical implementation and real-time applications. However, there are some limitations to acknowledge. First, our study primarily focused on the early-stage diagnosis of MCI and does not address the forecasting of disease progression. Future research is needed to extend the framework capabilities to predict the progression of brain diseases. Moreover, while our proposed framework provides explainability by visualizing a probability map to aid clinicians and patients in understanding the models' decisions, estimating decisions uncertainty remains an important consideration that requires further exploration. Incorporating uncertainty estimation into the framework would enhance its reliability and trustworthiness. Lastly, given the recent advancements in multi-modal learning, it would be worthwhile to explore the integration of other neuroimaging modalities, such as PET and diffusion-weighted imaging, to improve the overall performance and diagnostic accuracy of our current framework.

Conclusion
In this work, we proposed a deep learning train pipeline for patch-level feature representation learning on MRI scans. To alleviate the problem caused by predetermined brain regions for patch-level feature representation learning, we proposed a PG-BrainBagNet framework for jointly learning discriminative brain-region localization and disease identification in an end-to-end manner. We conducted both the AD diagnosis and MCI conversion Figure 6. Examples of false negatives from the model trained with large patches but correctly predicted by the model trained using small patches. Each column indicates one sample labeled as progressive mild cognitive impairment (pMCI), where the number next to the # denotes the corresponding image ID of an input MRI scan. In addition, the blue and red colors indicate high class evidence for the stable mild cognitive impairment and pMCI class in that region, respectively. www.nature.com/scientificreports/ prediction tasks on two publicly available datasets, thereby demonstrating the validity of our proposed method. Specifically, our PG-BrainBagNet obtained the best classification performance over competing methods. Furthermore, our PG-BrainBagNet effectively increased the classification performance, when localization of the subtle changes was required. We also demonstrated the interpretability of the proposed method by tracing the rationale for the model predictions down to the small patch level.

Methods
We propose PG-BrainBagNet as depicted in Fig. 1, wherein our proposed framework consists of four kinds of networks: the encoder, classifier, position embedding, and gate network. Specifically, these networks are organized into two branches such as patch-level prediction and position-based gating and each branch takes different inputs: the input MRI scan X and the position indicator I ′ . The input MRI scan X can be considered a set containing M patches X = {x 1 , · · · , x M } . I ′ represents the patch position information: the position indicator. In the following section, we present the details for jointly learning AD-related local morphological changes and the regions where the discriminative changes sustainably appear based on the branches.

Patch-level features extraction and classification.
The patch-level prediction branch comprises an encoder E s φ and classifier network C ψ parameterized with φ and ψ , respectively. First, according to the receptive field size of the top-level feature maps, we construct an encoder network that can adjust the patch size for feature extraction to handle local and morphological changes distributed in the whole brain. The configured encoder takes the whole brain image X ∈ R W×H×D×1 as input and extracts the local features from 3D patches of size s × s × s , where W, H, and D denote the size of images' width, height, and depth, respectively. The patch-level features, extracted from the whole brain, are represented as feature maps X ∈ R w×h×d×f , where w, h, and d denote the size of 3D spatial dimension, and f is the size of the feature space. In particular, the kernel and stride sizes in the convolution operator allow adjusting the patch sizes and the distance between them. Furthermore, we employed this approach introduced in BagNets 18 to construct our proposed network architectures. Specifically, based on BagNets, we rebuilt shallower encoders utilizing 3D convolutional layers. The goal of configuring an encoder network E s φ is to represent local features extracted from patches of size s × s × s. Specifically, encoder network E s φ consists of a convolutional block, max-pooling, and four residual blocks, as illustrated in Fig. 1. All convolutional blocks in this study include sequential operators of the convolutional layer, instance normalization layer, and rectified linear unit (ReLU) activation function. The kernel and stride size for the first convolutional block is set to 5 × 5 × 5 and 2 × 2 × 2 , respectively. For the following max-pooling layer, the kernel size is 3 × 3 × 3 , and the stride size is 2 × 2 × 2 . The feature maps yielded by max pooling have a receptive field size of 9 × 9 × 9 . Thus, if the receptive field size does not increase further, the encoder can extract features with a specific receptive field size of 9 × 9 × 9 from the whole brain. With nine as the minimum size, we constructed encoders based on five receptive field sizes according to the following residual blocks.
Overall, we can achieve patch-level feature representation X ∈ R w×h×d×f for an individual MRI scan X . Then, the classifier network C ψ converts the f-dimensional vector into a scalar for patch-level responses and produces X ∈ R w×h×d×1 . The patch-level responses X = (x 1,1,1 , · · · ,x i,j,k , · · · ,x w,h,d ) comprises local responses x i,j,k ∈ R . As both the encoder and classifier networks extract local responses in the specific receptive field size and share the extracting function overall spatial dimensions, each local response was extracted by a 3D patch in the brain without considering its position in the brain.
Position-based gate for AD-related brain-region localization. This branch is to represent the probability of detecting AD-related morphological changes in patches centered on specific coordinates in MR scans. As all MRI scans were aligned in a 3D template in the image processing step, a 3D space can be shared and is applicable over the samples. In addition, all 3D patches are single-scale patches with 3D cubic shapes. Thus, the patches distributed in the entire brain can be differentiated only by the patch position information, and the center patch position is the representative position information. The simplest method to indicate positions is to use a one-hot representation. However, this approach can be inefficient due to the numerous patches and ignores the volumetric position in 3D space. This problem can be efficiently addressed using the Cartesian coordinate system.
Inspired by a representation proposed in 57 , we constructed a 3D complete translation invariance to specify a 3D Cartesian space. The 3D Cartesian space coordinates could be represented in three channels, such as Based on the position information, we obtain the center position information for the patches used in the patch-level prediction branch. The center position information is represented as a position indicator I ′ . The extraction of the position indicator is described in Fig. 2b. First, when using data augmentation such as image translation and cropping, the representation of 3D Cartesian space I should be transformed in the same way as the input transformation, which results in the same spatial dimension as the input MRI scan. Then, based on the encoder network, the center positions of the receptive field are hierarchically extracted. The extracted position indicator I ′ ∈ R w×h×d×3 is taken as input for the position-based gating branch.
The position-based gating branch generates translation-dependent outcomes, which is not possible in the patch-level prediction branch. As described in Fig. 1, parameterized functions in position embedding P π and gate network G ρ consist of convolutional layers, and all convolutional layers in both networks are point-wise convolutions parameterized by π and ρ . In the position embedding network, the semantic feature representation Î were extracted to detect the task-oriented discriminative region by increasing the number of feature maps. Furthermore, the number of output feature maps was decreased in the gate network to encode the semantic feature representation. Finally, the remaining feature maps were averaged and activated by the sigmoid activation www.nature.com/scientificreports/ function to generate discriminative probability map G = (g 1,1,1 , · · · , g i,j,k , · · · , g w,h,d ) . The discriminative probability map consists of g i,j,k ∈ [0, 1] , representing the position-based response located in (i, j, k). By constructing position indicator I ′ based on the representation of coordinates in the 3D Cartesian space, absolute positioning can be performed and shared over the MRI scans for each patch. Therefore, the trained position embedding and gate network represent the high response in the region where the AD-related morphological changes are consistently captured.
Gate-based pooling for image-level prediction. By considering a 3D whole brain to be a bag and considering the local features extracted from 3D patches distributed in the whole brain to be instances, the proposed framework can be considered a MIL framework. In conventional MIL-based classification problems, permutation-invariant pooling operators (e.g., max and mean) have been widely used to aggregate instance-level representation into bag-level representation. Just as 18 introduced GAP for patch-level responses in aggregation, the mean operator has also been used as a representative aggregation function, especially when more than one instance is needed to identify a bag. The mean operation can directly calculate the image-level response, z, as follows: A function parameterized by neural networks was proposed in 38 to detect key instances and aggregate the responses based on them. Inspired by the aggregation method, we defined the image-level response by aggregating patch-level responses through position-based outcomes of the gate network. The element-wise multiplication between patch-level responses X and discriminative probability map G results in patch-level class evidence E ∈ R w×h×d×1 . The total amount of the discriminative brain region is unknown; thus, the normalization is performed based on the sum of the discriminative probability map so that the amount of the gated regions is independent of the diagnostic results. The aggregation of patch-level class evidence infers image-level abnormality and is defined as follows: where e i,j,k = g i,j,kxi,j,k . The image-level response z is directly activated by the posterior probability ŷ = p(y|X) using the sigmoid activation function. The patch-level class evidence E directly reveals which patches made a significant contribution in the final decision, making the model transparent and interpretable.
Joint learning of pathological brain-region localization and disease identification. The overall parameters (i.e., φ , ψ , π , and ρ ) are trained based on the image-level classification objective. To better train from the generalization perspective, the proposed models were trained using two additional techniques: label smoothing and balanced cross-entropy, referring to prior studies 56,58,59 . The classification loss function is described as follows: where y LS ∈ {0.1, 0.9} is the modified target and β is a hyperparameter addressing the imbalanced classification problem. In addition, β was set to the inverse class frequency. Precisely, the function was calculated using the number of samples with negative annotation ( y = 0 ) divided by the total number of samples. The gradient generated by classification loss updates the parameters, φ , ψ , π , and ρ. Moreover, the element-wise multiplication operation between G and X allows both forward and backward propagation to be highly dependent on each other. However, in the early stages of training, randomly initialized parameters yielded both X and G . To impose the framework to explore more discriminative brain regions localization, we employ an entropy loss for maximization of entropy G , as follows: The gradient generated by the entropy loss is affected on parameters π and ρ . The final total loss function is defined using hyperparameter to weigh the classification loss and entropy loss. Our proposed network is trained in an end-to-end manner with the following loss function:

Data availability
We have evaluated our proposed method on the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Australian Imaging, Biomarker & Lifestyle Flagship Study of Ageing (AIBL) dataset. Both datasets are publicly   (g i,j,k logg i,j,k + (1 − g i,j,k )log(1 − g i,j,k )),  www.nature.com/scientificreports/ available, and more information can be found at the following link: (ADNI) https:// adni. loni. usc. edu/ data-sampl es/ access-data/, (AIBL) https:// aibl. csiro. au/. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.