Multimodal multitask learning for predicting MCI to AD conversion using stacked polynomial attention network and adaptive exponential decay

Early identification and treatment of moderate cognitive impairment (MCI) can halt or postpone Alzheimer’s disease (AD) and preserve brain function. For prompt diagnosis and AD reversal, precise prediction in the early and late phases of MCI is essential. This research investigates multimodal framework-based multitask learning in the following situations: (1) Differentiating early mild cognitive impairment (eMCI) from late MCI and (2) predicting when an MCI patient would acquire AD. Clinical data and two radiomics features on three brain areas deduced from magnetic resonance imaging were investigated (MRI). We proposed an attention-based module, Stack Polynomial Attention Network (SPAN), to firmly encode clinical and radiomics data input characteristics for successful representation from a small dataset. To improve multimodal data learning, we computed a potent factor using adaptive exponential decay (AED). We used experiments from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort study, which included 249 eMCI and 427 lMCI participants at baseline visits. The proposed multimodal strategy yielded the best c-index score in time prediction of MCI to AD conversion (0.85) and the best accuracy in MCI-stage categorization (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$83.19\%$$\end{document}83.19%). Moreover, our performance was equivalent to that of contemporary research.

www.nature.com/scientificreports/ early diagnosis of AD, with consistently encouraging findings 12 . In recent decades, several studies have been proposed for automatic detection of AD 6,[13][14][15][16] . Various neuroimaging signals such as magnetic resonance imaging (MRI) 17,18 , functional magnetic resonance imaging (fMRI) 19,20 , positron emission tomography (PET) 21,22 , electroencephalography (EEG) [23][24][25] , and magnetoencephalography (MEG) 26,27 have been investigated to determine if there are any anomalous clustering coefficients or distinctive path lengths in the brain networks of AD patients. The ability to diagnose and categorize MCI at an early stage helps physicians to make better informed judgments about clinical intervention and treatment planning at a later stage, which has a significant influence on cost-effectiveness of long-term care services 28 . However, only a few studies on the features of brain networks in MCI patients have explored the properties of brain networks at different phases since brain abnormalities are so subtle 29,30 . Feature fusion strategies have gained significant attention in the medical field for their ability to integrate diverse information sources and enhance diagnostic accuracy. Multimodal image fusion techniques, as highlighted by Wang et al. 31 , enable the combination of complementary information from different imaging modalities, such as MRI, CT, and PET, to improve disease interpretation. Deep learning-based approaches, exemplified by Li et al. 32 , leverage feature fusion to enhance medical diagnosis by effectively integrating multimodal information. Moreover, Tong et al. 33 demonstrated the potential of feature fusion in Alzheimer's disease diagnosis, utilizing hybrid weighted multiple kernel learning to integrate clinical assessments, genetic profiles, and neuroimaging data. By leveraging feature fusion strategies, medical researchers and practitioners can harness the power of multiple data modalities to improve disease detection, localization, and overall patient outcomes. In this paper, we will present a straightforward and efficient fusion equation designed to combine multimodal data.
In this paper, we proposed a novel attention-based mechanism for multimodal multitask learning of AD progression. We employed MRI scans and clinical data to distinguish eMCI from lMCI while also predicting the time to AD conversion. We extract three brain regions, in particular, such as gray matter (GM), white matter (WM), and Cerebrospinal Fluid (CSF) from T1-MRI image using the statistical parametric mapping (SPM) toolbox (https:// www. fil. ion. ucl. ac. uk/ spm/). Then, we estimated the texture and shape features from the masked regions using the PyRadiomics toolbox (https:// pyrad iomics. readt hedocs. io/). Consecutively, we introduced a novel deep learning (DL) approach called stacked polynomial attention network (SPAN) for learning a more accurate approximation basis for all polynomials of bounded degree 34 . Two branches with SPAN and dropout layers are employed to encode the clinical and radiomics representations, and the prominent characteristics of both branches are effectively merged using our proposed formula, adaptive exponential decay (AED). The composite representation is scaled using a series of fully connected layers. Finally, the probability of lMCI and the hazard rate of AD conversion are calculated simultaneously as multitask learning. The main contributions of our studies are as follows: • We proposed a multimodal multitask learning based approach to synchronously classify eMCI and lMCI stages in AD patients and predict the time to AD conversion from these MCI patients for early diagnosis of AD. To the best of our knowledge, this is the first study to integrate two tasks: the categorization of the MCI stage and the prediction of the period from the MCI stage to the onset of AD. • Technically, we proposed a novel attention-based mechanism, SPAN, to learn data representations from finite sample datasets in a practical and effective manner. • We carried out analysis of the exploratory investigation of radiomics characteristics for predicting the course of AD in three brain areas (GM, WM, and CSF). • We proposed the use of a decay factor, AED, to aid in the acquisition of the dominant representation across modalities. • We experimented on a public dataset and employed cross-validation to show the generalization of the proposed system. Several aspects of disease analysis were exploited to understand the course of AD better.

Study participants.
To evaluate the efficiency of the proposed framework, we used the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort, which includes diagnosis of 1, 737 patients (ages 54.5 to 98.6 years) from 2005 to 2017. According to our scopes, which focus on the tasks of MCI-stage classification and timeto-AD prediction, we only selected patients who are diagnosed as ether eMCI or lMCI at baseline timepoint. Furthermore, we cleaned up the raw data through removing timepoints that had been ether duplicated or had implausible measurements, and we screened out irreversible individuals who had altered their condition from AD to MCI or from MCI to cognitive normal (CN) during the course of the study's history. Bases on given patients' IDs, we manually collected their corresponding MRI scans from the ADNI site. In total, we obtained 249 eMCI and 427 lMCI patients at baseline diagnosis. Table 1 presents the collected data statistic from the ADNI cohort for two groups of eMCI and lMCI patients. There are significant differences between the two groups in terms of age, Clinical Dementia Rating Scale-Sum of Boxes (CDRSB), Mini Mental State Examination (MMSE), Alzheimer's Disease Assessment Scale-13 (ADAS13), Rey Auditory Verbal Learning Test (RAVLT), Functional Activities Questionnaire (FAQ), volumetric and PET biomarkers ( p < 0.05).
To determine time-to-AD conversion, for uncensored patients, we assume that the conversion time is the time span between the baseline diagnosis and the first observation of AD. When considering the censored patients, the conversion time is calculated by adding the delaying time to their most recent visits. The data distribution and conversion time are visualized in Fig. 1. For the eMCI cases, the censored patients outnumber the uncensored ones. In Fig. 1b, the uncensored patient event occurs when the patient is diagnosed with AD, while the censored patient event occurs at the end of the study, which is the last observation on this patient. The details of data For quantitative evaluation, we use c-index score (CI), Brier score (BS), and mean absolute error (MAE) as criteria of time prediction task, while using accuracy (Acc), average precision (AP), precision (Pre), recall (Rec), F1-score ( F 1 ), and area under receiver operating characteristic curve (AUC) as criteria of classification task. Comparison to conventional studies. In this section, we present a comprehensive analysis of the performance of our proposed model and the existing research in terms of prediction of time-to-AD conversion and MCI-stage classification, as shown in Tables 2 and 3. Table 2 highlights the performance of various methods in prediction time-to-AD conversion, including our proposed approach, along with the modalities, data size, and evaluation metrics employed. The study by Polsterl et al. 35 utilized 3D hippocampus data along with clinical data to predict the conversion time to AD. Their approach achieved a CI of 0.803. Lu et al. 36 focused on MRI and genetic data as their modalities for conversion time prediction. They reported a CI of 0.681, indicating moderate predictive performance. The BS, a measure of calibration, was reported as 0.147, suggesting some room for improvement in calibration. Nakagawa et al. 37 employed gray matter (GM), patient age, and Mini-Mental State Examination (MMSE) scores as input features for their prediction model. Their approach achieved a CI of 0.83 when evaluating on both NC and MCI patients, and a CI of 0.75 when using MCI set only. Ho et al. 38 investigated the use of demographics and brain biomarkers for conversion time prediction. Their approach achieved a CI of 0.804, similar to the performance reported by 35 . The BS value for this study was reported as 0.153, indicating good calibration. In comparison, our proposed method utilized radiomics features extracted from MRI scans along with clinical data for conversion time prediction of MCI patients. Our approach achieved a significantly higher CI of 0.846, indicating improved predictive performance compared to the other methods discussed. Additionally, the BS value for our approach was reported as 0.132, suggesting good calibration and accurate estimation of conversion time. Table 3 summarizes the Acc and AUC values achieved by various methods for the classification of eMCI and late lMCI. Suk et al. 39 utilized MRI and PET modalities for the classification of eMCI and lMCI, achieving an accuracy of 75.92% and an AUC of 0.75. This approach performed reasonably well in distinguishing between the two classes, although the AUC suggests room for improvement in capturing the discriminatory power of the model. Nozadi et al. 40 also employed MRI and PET data for classification, with a dataset of 164 eMCI and 189 lMCI samples. Their approach achieved an accuracy of 65.2% , indicating moderate performance in distinguishing between the two classes. Jie et al. 41 focused on resting-state functional MRI (rs-fMRI) as their modality for classification. Their approach demonstrated a higher accuracy of 78.8% and an AUC of 0.78. These results indicate better discrimination between eMCI and lMCI compared to the previous approaches utilizing MRI and PET. Zhang et al. 42 also utilized rs-fMRI data for classification and achieved an accuracy of 83.87% and an impressive AUC of 0.9, demonstrating superior performance in accurately distinguishing between the two www.nature.com/scientificreports/  Performance on prediction of MCI to AD conversion. We individually investigated clinical and radiomics characteristics to determine the effectiveness of our proposed model compared to the unimodal approach.
In the Performance on combination of radiomics features subsection of the Supplementary Materials, we demonstrated that the optimal combination of [CSF][texture] features was utilized as the radiomics input for our proposed model. In comparison to the study conducted by Ho et al. 38 , we replaced the SPAN module with a residual-attention (RA) module for feature encoding. Our SPAN algorithm exhibited superior performance in predicting conversion time-to-AD with clinical features, achieving a higher CI (0.82 compared to 0.8) and lower MAE (454 days compared to 510 days). Additionally, it slightly improved the performance of the MCI-stage classification task. Notably, the SPAN encoder outperformed the RA encoder in both tasks, resulting in a reduction of 56 days in MAE and an improvement of 0.89% in accuracy. Instead of the proposed AED fusion strategy, we utilized concatenation (Concat) for comparison. Experimental results showed that our proposed AED approach outperformed traditional concatenation in representation fusion for both tasks. It yielded a 0.02 increase in CI (using SPAN encoder), a reduction of up to 59 days in MAE (using RA encoder), a reduction of up to 8 days in MAE (using SPAN encoder), and an increase of up to 0.67% in accuracy (using RA encoder and SPAN encoder). The detailed results of the prediction and classification tasks for MCI to AD conversion are presented in Table 4.
In conclusion, multimodal approaches surpassed the use of unimodal approaches, both for clinical and radiomics features. Moreover, the SPAN module consistently outperformed the RA module. Integrating SPAN with AED further enhanced performance compared to utilizing SPAN with the Concat operation. www.nature.com/scientificreports/ In addition, further analysis of performance on combination of radiomics features and visualization of timeto-AD conversion are described in the Supplementary Materials, section Ablation studies.

Discussion
Recent neuroimaging studies revealed that individuals diagnosed with MCI and AD have considerable disruption in either the structural network or the functional network when compared to a healthy control group 17,44 . Few studies have investigated the features of whole brain networks in patients with MCI at various stages of the disease. Zhang et al. 42 utilized the graph theory to measure the relationship between changes in the brain network connectivity from the resting-state fMRI. Then, the support vector machine (SVM) was used to distinguish eMCI from lMCI at different frequency bands, and achieved the best performance in slow-5 band with a 83.87% accuracy. Transfer learning approaches are usually used to overcome privacy and cost issues for a massive quantity of annotated data, which entails applying a pre-trained model to new problems using a smaller dataset. By taking the advantage of these facts, Mehmood et al. 18 developed a layer-wise transfer learning model based on VGG architecture family 45 to segregates between eMCI and lMCI and achieved a 83.72% accuracy. Cui et al. 43 proposed two-stage algorithm based on particle swarm optimization (PSO) for removing redundant features and adaptive LASSO logistic regression model for selecting the most relevant features to predict AD stages.Experimental results have been shown a 76.13% accuracy on stable MCI (sMCI) vs converted MCI (cMCI) patients.
A survival analysis is a type of statistical study that examines time-to-event data, which describes the period between a time origin and an endpoint of particular interest 46 . Polsterl et al 35 proposed a wide and deep neural network for survival analysis that learns to detect individuals who are at a high risk of advancing to AD using information from 3D hippocampal geometry and tabular clinical data. According to their findings, tabular clinical makers with a median c-index of 0.750 are already good predictors of conversion from MCI to AD. In addition, in the hippocampus volume, the median c-index climbed to 0.803 when the hippocampus volume was included. Nakagawa et al. 37 discovered a deep learning method-based survival analysis could be used to assess the likelihood that an individual will get AD over a particular period of time. They approached the survival problem in a unique way and demonstrated encouraging results across many cohorts. Ho et. al 38 proposed a modification of DeepSurv architecture 47 , called RASurv to analyze the time-to-AD conversion for both cognitive normal and MCI patients. Their model achieved a competitive performance to other methods with a c-index score of 0.804.
The difficulty of precisely determining when an individual transitioned to AD can be attributed to the no studies on the topic of prediction of time-to-AD conversion. Typically, the occurrence happens before an individual is diagnosed as AD. However, we usually assume that the event occurs at the timepoint that the patient is diagnosed as conversion from MCI to AD to alleviate the problem. The above-conventional study focused exclusively on single tasks, despite the possibility of a correlation between MCI phases and time-to-AD conversion. In general, an eMCI patient has a lower risk of developing AD within a short period of time than an lMCI patient. As a result, it is essential to master two tasks concurrently: MCI-stage classification and conversiontime-to-AD prediction. In addition, the criteria for eMCI and lMCI can be found in the Supplementary Materials, section Criteria for the MCI stages.
This study presented a novel framework of multimodal multitask learning to discriminate eMCI patients from lMCI patients and forecast conversion time till the onset of Alzheimer's disease. The proposed model derived features from clinical representations (which include patient information, cognitive measurements, and biomarkers) as well as radiomics representations (which are estimated from brain MRI). The SPM program was used to normalize brain MRI dimension and segment three different brain regions: the GM, WM, and CSF, in particular. These regions' masks combined with brain image were used to determine the shape and texture of radiomics characteristics with the PyRadiomics program. We proposed SPAN (stacked polynomial attention network) to effectively and reliably capture the approximation basis for all polynomials of constrained degree. The clinical and radiomics characteristics were supplied into two branches of SPAN and dropout series, which were then used to encode the appropriate information in the patient's medical record. After that, we constructed an adaptive exponential decay (AED) factor to combine the encoded representations from both branches together. We evaluated the proposed model on the ADNI cohort that overcomes the state-of-the-art performance.
However, it is essential to obtain radiomics characteristics from MRI scans to lower the dimension of the 3D images due to the large batch size required for ranking optimization in this study. However, the performance of radiomics characteristics is significantly low when compared to clinical data as shown in Table 1. This means that radiomics features may contribute less to the multi-model and may potentially introduce bias into the overall network. As a result, in future studies, we will examine other strategies for extracting more robust representations of 3D images. Furthermore, we analyzed global brain regions (such as the GM, WM, and CSF) in this study, despite the fact that there are critical areas (such as frontal lobe, motor cortex, sensory cortex, parietal lobe, occipital lobe, and temporal lobe, etc.) that often influence AD conversion. Therefore, future studies will include examining the relationships between brain areas.
Overall, we firmly believe that our study holds important value in the field of AD diagnosis and understanding. Our proposed multimodal multitask learning approach, attention-based mechanism, and exploratory investigation of radiomics characteristics provide valuable insights and potential avenues for early diagnosis and improved understanding of the course of AD. By integrating tasks, improving data representations, and incorporating multimodal information, we aim to advance the field's understanding of AD progression and contribute to the development of more effective diagnostic tools. www.nature.com/scientificreports/

Methods
We developed a new paradigm for identifying MCI stages and predicting time-to-AD conversion using clinical and radiomics characteristics. First, we preprocess the raw clinical data and estimate radiomics features from MRI scans. Then, we encode the clinical and radiomics representations using a succession of SPAN and dropout layers. The AED algorithm efficiently fuses the two branches' prominent characteristics predict the probability of eMCI vs lMCI phases and the hazard rate of AD conversion. The overall process is shown in Fig. 2

. Note that this article does not contain any studies involving animals or human participants performed by any of the authors.
Preprocessing and feature extraction. Numerous studies on Alzheimer's disease dementia make use of information obtained through expensive and invasive techniques such as brain imaging or spinal taps to predict the risk of getting Alzheimer's disease dementia or fast cognitive decline in the future. A low-cost and noninvasive approach to studying the evolution of Alzheimer's disease dementia might be based on clinical data (e.g., demographics, vital signs, medicines, laboratory data, vital signs, and current medical problems). Because clinical data can be supplied in a variety of forms, it is necessary to do data preprocessing and transformation prior to training a model on clinical data. In this study, we perform one-hot encoding for transforming categorical data and z-score normalization for infinitive numerical values and maximum normalization for limited numerical values, excluding volumetric biomarkers, which are scaled by dividing the total intracranial volume (ICV) of each individual. Since clinical data commonly appear with missing data in medical studies when the value of the variables of interest is not measured or recorded for all of the participants in the sample, we utilize the Multiple Imputation by Chained Equations (MICE) algorithm 48 to impute the missing values. Radiomics, which is based on the high-dimensional quantification of medical scans and enables the retrieval of more precise features than standard visual interpretation, can reveal information for treatment interventions 49,50 . There has been some investigation into the use of radiomics in identifying the progression of AD 51,52 . These investigations revealed that radiomics biomarkers can be used to classify individuals with MCI who are at high risk of developing Alzheimer's disease in the future. Furthermore, radiomics biomarkers in combination with clinical analysis can vastly enhance the prediction accuracy of MCI to AD. To extract the radiomics features, we first utilize the "Normalization" module of SPM toolbox to scale the intensity and space of the three-dimension MRI image since the brain structure varies from person to person. Next, we segment the normalized brain into www.nature.com/scientificreports/ three regions such as GM, WM, and CSF using the "Segmentation" module. Then, we extract the various types of radiomics features, which can be divided into shape and texture groups, using the PyRadiomics tool. In our approach, we first normalize and standardize all features to ensure that features from different scales or with varying distributions were placed on a comparable scale, that prevent any single feature from dominating the learning process and promote fair contributions from all features. Next, we concatenate all features of each type of representation, namely radiomics and clinical, to create a single feature vector for each representation. This concatenation step ensures that all relevant information from the respective feature sets is preserved. The details of clinical data preprocessing and radiomics extraction can be found in the Supplementary Materials, section Clinical and radiomics features preprocessing.
Stacked polynomial attention network (SPAN). Recent years have seen an increase in the use of attention mechanisms to not only improve the performance but also the explainability of deep learning techniques. Initially, the attention mechanism was mostly employed to describe sequence dependencies independent of their real distances 53,54 . Abd Hamid et al. 55 used an attention mechanism and a global average pooling (GAP) layer to extract the most prominent information from an MRI image for the purpose of differentiating between AD states. In previous study 56 , researchers stated that the stacked deep polynomial network (S-DPN) can enhance the representation performance of the retrieved characteristics and held promise for the neuroimaging-based AD diagnosis. Based on these findings, we developed a novel attention mechanism based on S-DPN and dubbed the stacked polynomial attention network (SPAN) for exploiting attended representation from constrained indeterminates. Given an input feature Z, the sequential expressions of the first polynomial network of the SPAN module are as follows: n represent indeterminates of polynomial function with n degrees of the first network, W (1) n , b (1) represent trainable parameters, σ () is softmax function for generating attention map M (1) A , and Ẑ (1) represents the attended representation. Ẑ (1) is then fed to the second polynomial network, which is similar to the first one, to stack up feature representation and yield a better and deeper structure. The scaled exponential linear unit (SELU) activation function ( 57 ) is then used to add non-linearity to the neural network. The sequence of the second polynomial network is expressed as follows: Multimodal fusion network. Multimodal data can help improve the accuracy of diagnosis, prediction, and overall performance of learning systems ( 58 ). For instance, Venugopalan et al. 59 proposed multimodal deep learning models for AD data fusion to improve AD stage identification. Their trials established that the multimodal strategy outperformed the unimodal approach. De Jesus Junior et al. 60 described the discovery of multimodal indicators of AD severity for individuals in the early stages of the disease through combining Resting-State EEG and structural MRI data. In addition, their findings demonstrated the efficacy of the multomodal strategy. In this study, we present the multimodal multitask architecture for classifying MCI stage and predicting time-to-AD conversion. Our proposed model has two branches, as shown in the Fig. 3. The first branch operates on radiomics features, which are generated from a 3D MRI image using the SPM and PyRadiomics toolboxes, while the second branch operates on preprocessed clinical data. Each branch is connected to a SPAN block, which is comprised of a series of SPAN-followed dropout layers.
Assume that I radiomic and I clinical are the output features from SPAN blocks of radiomics and clinical features, respectively, we define an adaptive factor to select superior candidates from both representations. The operations for multimodal fusion are expressed as follows: www.nature.com/scientificreports/ where τ represents adaptive exponential decay (AED), W τ , U τ , b τ represent trainable parameters, pool(·) represents maximum pooling operation, S(·) represents sigmoid function, and I fused represents the fused representation from clinical and radiomics features. We used the exponential negative rectifier for decay factor τ to ensure that each decay rate decreases asymptotic within a tolerable range of 0 to 1. Lastly, the fused features is used to immediately predict the probabilities of eMCI vs lMCI, ŷ pr , and the hazard rate of AD conversion, ŷ hr , as follows: where W cl , b cl , W hr , b hr represent trainable parameters.
Objective functions. To optimize the model's cost, we joint two objective functions of two tasks: MCI-stage classification and conversion-time-to-AD prediction. For classification task of eMCI vs lMCI, each predicted probability to the actual class output is measured by the binary cross-entropy (BCE) ( 61 ). Once the score has been calculated, probabilities are penalized based on the distance from the predicted value. That indicates how near or far the actual number is from the estimate. Given the actual class y pr (0 for eMCI and 1 for lMCI), the BCE formula is as follows: where M is the number of samples within an iteration. Besides, we use the negative log-likelihood function 62 to minimize model's loss for the conversion-time-to-AD prediction task. Its expression is as follows: In order to optimize the L NLL , we need to maximize the term of ŷ hr (i) − log j∈R(T j ) eŷ hr (j) for each patient i having event E = 1 (uncensored patient who is converted to AD) for every censored patient (non-converted to AD). It follows that we must raise the risk factor for every uncensored patient i while simultaneously lowering the risk factor for patients j who have not experienced the event until time T i , which is the observed time-to-AD for patient i. Finally, we add both loss functions for simultaneously multitask learning and arrive at the following result:  www.nature.com/scientificreports/

Data availability
The dataset generated and analysed during the current study are available in the Test Data section of the