Prediction of age at onset in Parkinson’s disease using objective specific neuroimaging genetics based on a sparse canonical correlation analysis

The age at onset (AAO) is an important determinant in Parkinson’s disease (PD). Neuroimaging genetics is suitable for studying AAO in PD as it jointly analyzes imaging and genetics. We aimed to identify features associated with AAO in PD by applying the objective-specific neuroimaging genetics approach and constructing an AAO prediction model. Our objective-specific neuroimaging genetics extended the sparse canonical correlation analysis by an additional data type related to the target task to investigate possible associations of the imaging–genetic, genetic–target, and imaging–target pairs simultaneously. The identified imaging, genetic, and combined features were used to construct analytical models to predict the AAO in a nested five-fold cross-validation. We compared our approach with those from two feature selection approaches where only associations of imaging–target and genetic–target were explored. Using only imaging features, AAO prediction was accurate in all methods. Using only genetic features, the results from other methods were worse or unstable compared to our model. Using both imaging and genetic features, our proposed model predicted the AAO well (r = 0.5486). Our findings could have significant impacts on the characterization of prodromal PD and contribute to diagnosing PD early because genetic features could be measured accurately from birth.

Scientific RepoRtS | (2020) 10:11662 | https://doi.org/10.1038/s41598-020-68301-x www.nature.com/scientificreports/ to model and understand how genetic factors influence the structure or function of the human brain. Canonical correlation analysis (CCA) is a popular method of imaging genetics to combine high-dimensional data types. The CCA finds the best linear transformations for imaging and genetic features so that the correlation between imaging and genetic can be maximized. In practice, the sparse CCA (SCCA) method is adopted to find a sparse (i.e., only a few components) set of features through regularization, such as with least absolute shrinkage and selection operator (LASSO). The existing SCCA was focused only on finding the best linear transformations for imaging and genetic features, respectively. Therefore, the resulting set of features has a high imaging-genetic association, but it does not necessarily have a strong association with the disease being studied. This makes interpreting the selected features challenging as we are not sure of the feature-target disease association. To improve the interpretation, we extended the sparse CCA to include a third data type related to the target task (i.e., AAO of PD), which leads to simultaneously exploring the associations of the imaging-genetic, genetic-target, and imaging-target pairs. Our rationale was to explore all three possible pair-wise associations to leverage the available data fully. We named the approach objective specific neuroimaging genetics emphasizing the third added data type related to the main objective.
In this study, we applied the objective-specific neuroimaging genetics approach to identify genetic and imaging features associated with AAO in PD patients, not in healthy controls. We also aimed to construct an analytical model to predict the AAO of PD using the identified genetic and imaging features simultaneously. Neuroimaging and single nucleotide polymorphisms (SNPs) data from the Parkinson's Progression Markers Initiative (PPMI) were obtained. Neuroimaging features that can reflect the differences in the brain structure and function with respect to AAO of PD were computed from the fractional anisotropy (FA) of diffusion tensor imaging (DTI) that is a sensitive method to assess PD pathophysiology and severity 13 . Then, we applied our proposed method with the FA values, PD-related SNP alleles, and AAO as three input data types. The results were then compared with existing methods.
We hope to better understand how genetic variations of specific risk genes lead to alterations in the brain structure and function with neuroimaging genetics. Such an understanding could have tremendous potential for accurately diagnosing and improving therapy for PD. The imaging and genetic feature used to predict AAO of PD could help with characterizing the prodromal period of PD. This study focused on AAO in PD patients only still, our study could be extended to predict diagnosis by replacing the AAO term with diagnosis status and ultimately contribute to the early diagnosis of PD. In particular, our study could make a significant contribution to preventive medicine in PD, combining the identified SNPs with the known environmental factors of PD. The neurodegenerative process has already substantially progressed when the diagnosis of PD is established based on the widely accepted clinical criteria 14 . However, our identified SNPs might serve as early biomarkers for AAO in PD as they can be measured accurately from birth.

Methods
Subjects. This study was a retrospective analysis, and institutional review board (IRB) approval was obtained from Sungkyunkwan University. Our study was performed in full accordance with the local IRB guidelines. Informed consent was obtained from all subjects. Our study data of 146 PD patients were obtained from the PPMI database 15 . We used DTI, T1-weighted MRI, and DNA genotyping data. Inclusion criteria were aged 30 years or older, diagnosis of PD (based on one of the following: presence of 1 asymmetrical resting tremor or 2 asymmetrical bradykinesia or 3 at least 2 of resting tremor, bradykinesia, and rigidity), disease duration of 1-24 months, Hoehn and Yahr (H&Y) stage of 1-2, and presence of the T1-MRI, DTI, and genotyping data. AAO was based on patients' recollection of the first parkinsonian motor symptom 3 . Additionally, we divided the patients into four subgroups based on AAO (younger than 50 years, 50-59 years, 60-69 years, and 70 years or older), according to Pagano et al 3 . The demographic and clinical data of enrolled subjects were illustrated in Table 1. www.nature.com/scientificreports/ Neuroimaging and genotype data. MRI data were obtained using a standard protocol for 3 T scan-DTI processing. Before DTI processing, T1-weighted images need to be processed first. The T1-weighted images were skull-stripped and nonlinearly registered onto the standard Montreal Neurological Institute (MNI) spatial frame using the FMRIB Analysis Group Software Library (FSL, https ://fsl.fmrib .ox.ac.uk/fsl/fdt) 19 . The DTI data were processed using the diffusion tensor analysis toolkit (FDT) of FSL. Head motion and image distortions induced by eddy currents were corrected by applying a 3D full-affine alignment of each image to the mean no-diffusion-weighting (b0) image. DTI data were averaged and concatenated after the correction of distortion. We generated voxel-wise maps of FA, which quantifies how elongated the diffusion tensor ellipsoid is. Additionally, it is known to be sensitive to a wide range of pathologies 20 . FA maps were registered onto the common MNI space using the same procedure to process T1 MRI. We adopted the automated anatomical labeling (AAL) atlas defined on the MNI space to specify the region of interests (ROIs). We finally calculated the FA values of 90 ROIs using the sample mean of voxel-wise FA values from each ROI.
Objective-specific imaging genetic associations. Below we use boldface lowercase letters to denote a vector and boldface uppercase letters to denote a matrix. Given datasets X ∈ R n×p , Y ∈ R n×q , and Z ∈ R n×1 , with n samples, where X denotes p features of neuroimaging (i.e., FA values) and Y denotes q features of genetics (i.e., the count of the non-reference allele (0, 1, 2) for each SNP), and Z denotes features related to target objective (i.e., AAO of patients with PD). Our proposed objective-specific SCCA (os-SCCA) is a special case of conventional three-way SCCA (TSCCA), where the third data type is from the target task. This leads to the www.nature.com/scientificreports/ optimization of three pairwise SCCAs simultaneously 21,22 . By replacing one data type of TSCCA with objectiverelevant information, we can find imaging genetic associations that are relevant to a particular task (i.e., AAO of PD). The formulation is defined as follows: where u and v were the corresponding canonical loading vectors, and w was scalar. Our approach is a simple version of CCA where only the first principal directions were considered to maximize the canonical correlation value. The canonical loading vector u for X (n × p matrix of neuroimaging data) contains information regarding what regions to choose over p regions. In the same vein, you can think of w as canonical loading vector for Z (i.e., AAO values). Since Z is an n × 1 vector w become a scalar (1 × 1 vector). We can rewrite the objective functions for os-SCCA as follows 21 : where u and v were regularization parameters. The l 1 penalty, controlled by u and v , was applied to induce sparsity and prevent overfitting 21 .
Imaging-genetic feature selection from os-SCCA and construction of prediction models. Due to a limited number of samples, we adopted a nested five-fold cross-validation. For the outer loop, we separated the data into training and test sets using a fivefold split. For the inner loop, the training data were subject to another fivefold cross-validation where the inner training set was used to train the model and the remaining set was used as the validation set to tune the hyperparameters of the model. The value of w was set using the formula given in a previous study adjusted for the dimensions of Z 21 . Optimal value of regularization parameters (i.e., u , v ) was determined within each training set via internal fivefold cross-validation (CV).
where X i ,.Y i , and Z i denoted the i-th subset of the validation set and u -I and v -i , denoted the estimated loading vectors from the datasets except for the i-th subset (training set, X -i , Y -i , and Z -i ) in the inner loop. We calculated the average metric score over the five folds in (3) and chose the average as the hyperparameters. Coming back to the outer loop, we used data from four folds to train the regression model and applied the learned model to the left-out test fold. We repeated this process five times with a different fold left-out each time. The software code used in this study is available at code-sharing website (https ://githu b.com/Ji-Hye-Won/os-tscca ). We applied FA values, the number of minor alleles (i.e., 0, 1 or 2) of PD-related SNP, and AAO of PD as three inputs for training samples to os-SCCA. The elements with non-zero coefficients of the loading vectors were the selected imaging and genetics features. Then, we constructed random forest regression models based on the selected imaging and genetic features to predict the AAO of PD. We trained a random forest of 500 regression trees using the selected features. Three regression models with only the selected imaging features, genetic features, and combined imaging and genetic features were built. This resulted in five sets of performance metrics of the models, which were averaged. We assessed the performance of the prediction using Pearson's correlation between the actual and predicted AAO of PD in the test fold. The root mean square error (RMSE) was also used 23 .
Over-representation enrichment analysis. The selected SNPs were annotated with the genes, and they were processed with an over-representation enrichment analysis (ORA) using an online database WebGestalt (https ://www.webge stalt .org/2019/). The ORA compares sets of genes annotated to pathways to a list of the identified genes that are significantly deferentially expressed. A hypergenometric test was adopted to detect an over-representation of the high-ranking genes among all the genes in a given category. The whole genome was selected as the reference gene list.
Imaging-genetic feature selection from other methods. The selected features and prediction models from the os-SCCA were compared with existing sparse multi-view feature selection methods. We compared our approach with LASSO and the minimum redundancy maximum relevance (mRMR), where only associations of genetic-target and imaging-target were explored. Our method extracts significant SNP and imaging features at the same time. Contrastingly, LASSO and mRMR select significant features of a target variable (i.e., AAO of PD) directly from larger sets of candidate predictors corresponding to features from imaging and genetic separately. In the case of mRMR, the number of selected features is a user parameter. This number was set as the same number of features obtained from our os-SCCA. We also constructed random forest regression models using imaging, genetic, and combined features from LASSO and mRMR based the training set data. We validated the models by applying them to test sets. In summary, the same procedures were applied using different selected features from the LASSO and mRMR approaches.

Results
Selected imaging and genetic features from the os-SCCA approach. We used FA values and genetic data of 146 PD patients from the PPMI database 15 . The FA values of 90 regions were computed from the DTI preprocessing. The 72 allelic statuses were selected from Parkinson's disease-associated variants for PPMI subjects. Our analysis is different from GWAS as we performed our experiments with only SNPs related to PD. Thus, the identified SNPs were not subject to statistical rigors of GWAS and should be interpreted as important variables of the prediction model. We applied the os-SCCA approach using these two types of data and target information (i.e., AAO of PD) in a five-fold cross-validation. The identified features were similar in the five sets of results. We only reported results that were common to all five sets. We identified FA of 14 ROIs related to both SNPs and AAO of PD in all five training sets. The 14 selected regions were as follows: left and right precentral gyrus (PreCG); left and right median cingulate and paracingulate gyri (DCG); left and right posterior cingulate gyrus (PCG); left and right superior occipital gyrus (SOG); left and right lenticular nucleus, putamen (PUT); left and right lenticular nucleus, pallidum (PAL); and left and right thalamus (THA).
Over-representation enrichment analysis results. Table 2 provides a summary of the identified 24 SNPs, including their corresponding genes. After annotating these SNPs, genes were analyzed using an ORA. The list contained 25 gene names in which 23 genes are mapped to 23 unique entrezgene IDs. Among 23 unique entrezgene IDs, 20 genes were annotated to the selected functional categories and the reference list, which were used for the enrichment analysis. A significant gene ontology analysis included negative regulation of the cellular amine metabolic process (false discovery rate [FDR] = 0.0324) and regulation of the cellular amine metabolic process (FDR = 0.0199). The top 10 categories were identified as enriched categories. These are shown in Fig. 2.
Selected imaging and genetic features from mRMR. The same number of features were used between our approaches and mRMR to enable a fair comparison. Each feature was ranked using the feature selection (FS) score provided from the mRMR, and we analyzed only a few features (i.e., the same number from os-SCCA) with the highest FS scores. The FS scores were typically high, above 0.2, for the first few imaging fea- Comparison of prediction models using the three approaches. We constructed random forest regression models using the selected imaging, genetics feature, and the combined features from os-SCCA to explain the AAO of PD in a five-fold cross-validation. Additional prediction models were built using the features obtained from mRMR and LASSO approaches and compared with those of os-SCCA. Using imaging features only, all three approaches worked well. Our proposed model showed a meaningful correlation (r = 0.5184, p = 0.0105; averaged) between the predicted and real AAO of PD over five left out test folds. The mean RMSE between the predicted and real AAO of PD was 8.1197. Using mRMR resulted in a correlation of r = 0.5260 with p = 0.0026, while using LASSO led to a correlation of r = 0.5586 with p = 0.0022. The prediction plots are displayed in the first column of Fig. 3. The correlation values derived from mRMR and LASSO were higher than www.nature.com/scientificreports/ our approach partially. A potential reason behind this trend is given in "Discussion" section. The mean RMSE values were 8.1652 using mRMR and 7.9354 using LASSO. Using genetic features only, the models were less successful than using imaging features (Fig. 3b). The LASSO failed to identify genetic features in some cases, and there were only a few features that had high enough FS scores. Our proposed model showed a modest correlation (r = 0.2678, p = 0.1364; averaged) between the predicted and real AAO of PD over five left out test folds. The mean RMSE between the predicted and real AAO of PD was 9.1678. Using mRMR resulted in a correlation of r = 0.1568 with p = 0.4457, while using LASSO led to a correlation of r = 0.0829 with p = 0.6684. In the case of LASSO, there were even negative values of r. The correlation values derived from our approach were higher than those derived from mRMR and LASSO. The second column of Fig. 3 displays these results. The mean RMSE values were 9.1678, 9.4043, and 10.1420 when using os-SCCA, mRMR, and LASSO, respectively.
Using both imaging and genetic features, our model showed prediction improvement (increase in correlation between real and predicted values) compared to using only imaging features. However, using mRMR and LASSO led to degraded performance (decrease in correlation) compared to using only imaging features even if more features were adopted. Our proposed model showed an improved correlation (r = 0.5486, p = 0,051; averaged) between the predicted and real AAO of PD over five left out test folds. The mean RMSE between the predicted and real AAO of PD was 7.9764. Using mRMR resulted in a correlation of r = 0.3029 with p = 2,186, while using LASSO led to a correlation of r = 0.5453 with p = 0.0025. The prediction plots using both imaging and genetic features are shown in the third column of Fig. 3. The mean RMSE was 11.8779 (mRMR), 8.0273 (LASSO), and 7.9764 (our approach). Overall, we found that our approach identified more meaningful features by fully exploring the three pair-wise associations present in the data.

Discussion
To the best of our knowledge, this is the first study to report on the predictions of AAO of PD in the field of imaging genetics. Recent clinical trials in PD have focused on neuroprotective treatment. Therefore, the early diagnosis and prediction of PD are becoming more important 24 . Various non-motor symptoms could present even before parkinsonian motor symptoms and be regarded as markers for PD development 25,26 . However, these non-motor symptoms are not PD-specific symptoms, and common in both PD patients and normal elderly subjects 1 . Many previous studies also tried to investigate protein biomarkers in the cerebrospinal fluid of PD patients 27,28 , but these proteins are only available from an invasive procedure with possible side effects. Further, these studies analyzed protein biomarkers in PD patients after diagnosis. Therefore, all these results were  www.nature.com/scientificreports/ the result of neurodegenerative changes 29 . By using neuroimaging and SNP data simultaneously, our prediction model might reflect not only the results of neurodegeneration but also the neurodegenerative changes associated with disease risk. This is because SNP data are available from birth. Two types of data in our study, genetic and neuroimaging information, are becoming widely available in healthcare, meaning that our model could be useful in clinical practice.
In this study, we adopted a multivariate model of os-SCCA to capture the associations of the imaging-genetic, genetic-target, and imaging-target pairs. This approach considers imaging and genetic data simultaneously to capture the multivariate nature of the data better, reduces the number of statistical tests, and regularizes high dimensional data. The resulting features were used to predict the AAO of PD, and the performance was enhanced if the features derived from os-SCCA were used. If each of the resulting brain regions, genes, and the interactions of brain region and gene from our study are studied in more detail, these resulting features could be used as biomarkers related to the prodromal period of PD. Also, our study requires fewer samples than conventional GWAS as there are not a lot of univariate tests. Future studies using more samples from independent cohorts are necessary to validate our findings fully.
The prediction models that used only imaging features showed a high correlation for all three approaches. The correlation values derived from mRMR and LASSO were higher than our approach partially due to the nature of the mRMR and LASSO. This may be because LASSO and mRMR are tailored for extracting continuous features (e.g., imaging features) directly related to AAO as the outcome. This translates to optimizing only the imaging-target associations. However, os-SCCA optimizes additional genetic-target and imaging-genetic associations simultaneously, which leads to holistic modeling of the multimodal data. The strengths of os-SCCA were well demonstrated in the predictive model with genetic features. Therefore, it becomes challenging to select a few related to the outcome. Using LASSO, there were folds with no genetic features selected. Using mRMR, although we ensured that the same number of features were used as in os-SCCA, the FS of the selected genetic features were very low, which might obscure the selection process. Only with jointly optimizing the genetic-target and imaging-genetic associations using our approach, could meaningful genetic features be selected. We were able to predict AAO with an RMSE of less than 10 years using our approach. This implies that approximate estimation at a 10-year interval is possible. Our results show that it was possible to predict the AAO of PD using only the genetic features from os-SCCA. These findings could be used as potential biomarkers related to AAO of PD, and help elucidate the associated pathomechanism of PD. Still, the identified SNPs need to be confirmed in additional validation studies with more samples. Most patients in our study have AAO between 50 and 60. Thus, we constructed a simple additional model using only the sample mean of all data (i.e., mean of AAO = 61.35). The RMSE was 9.4689, which fared worse than those using both imaging and genetic features (7.9764), only imaging features (8.1197), and only genetic features (9.1678).
The effectiveness of os-SCCA was seen by evaluating the performance of the predictive model using both genetic and neuroimaging data. By considering all thee pair-wise associations in selecting the features, the performance improved over models that used only imaging and genetic features in every fold. In models using features from mRMR, even if the model could use more features than those of os-SCCA, the performance decreased. In the case of LASSO, the performance was lower than os-SCCA. Overall, we confirmed that imaging and genetic features should be selected, considering all possible associations when using genetic and neuroimaging data at the same time in an imaging genetic framework.
We identified 25 genes annotated from 24 SNPs and 14 brain regions in all training sets. One of the resulting genes, NUCKS1, is known to have a functional association in the brain of people with PD. A significant association of expression and transcription levels of NUCKS1 with PD has been observed 30,31 . ACMSD is associated with aging and risk of PD, tentatively suggesting that this enzyme might influence pathogenesis 32 . Additionally, COMT, an enzyme involved in the degradation of dopamine, is a critical determinant of the availability of this neurotransmitter in the prefrontal cortex. COMT modulates the activity of the enzyme, affects cognition, and is magnified in the aging brain 33 . There are also studies showing that COMT is related to the late-onset of PD, and COMT is a modifier of the AAO in PD 34,35 . We investigated rationales for the identified brain regions. The PCG and PreCG are the regions that have age-related differences in the structure of the brain 36,37 . Other brain structures such as THA found in our study were also reported in early PD patients 38,39 . The ratio of dopamine transporter binding of PUT is reported to be related to the difference between old-onset PD and young-onset PD groups 40 . In particular, PCG is one of the regions that showed substantially different gene profile changes with age 41 . As symptoms of PD vary according to AAO, an in-depth study of the brain regions of our study combined with the PD symptoms might contribute to the enhanced characterization of PD subtypes and possibly lead to a redefinition of the subtypes 42 . Considering that neurodegenerative changes start a few decades before AAO, diverse changes in various brain regions are quite feasible. However, most previous studies have focused on the brain structure related to clinical features in PD patients, and very few of them assessed the association between AAO and brain structure in PD patients. Clinical characteristics of young-onset PD are already well-known from many previous studies [2][3][4] , but it was difficult to determine whether the relevant brain structures in our study were from AAO itself or were also linked with clinical features in young-onset PD patients. It is challenging to control for all the clinical diversity present in PD patients, and our results were consistent with previous studies which revealed the known involved structure in early PD patients. Considering there is no previous study focused on the AAO of PD, our results should be interpreted cautiously. Our primary objective was related to AAO of PD, and we did not distinguish between healthy controls (HC) and PD. This was partly because there was an ambiguity in HC from the PPMI database. The PPMI data are not follow-up data with a long duration, thus HC cases could develop PD in the future or stay healthy. This can be examined further in future studies using well-designed follow-up data. A longitudinal study is better suited for AAO of PD as it might offer more accurate AAO information of PD along with neuroimaging data near the time when neurodegeneration starts. www.nature.com/scientificreports/ Our study has some limitations. Our results were obtained using a cross-sectional design. Therefore, it is difficult to distinguish between PD progression and risk. Although imaging genetic analysis needs a smaller sample size compared to GWAS, we recruited a relatively small number of patients. Even with these limitations in study design and sample size, we were able to suggest significant prediction models for AAO of PD using genetic and neuroimaging features from os-SCCA. We hope this study is the first step to identifying the pathomechanism and related genetic factors in PD patients. Second, PD is a wide-spectrum disorder with diverse motor and non-motor symptoms 1 . Considering that AAO is related to various parkinsonian symptoms and progression, it is difficult to eliminate all the confounding factors. Therefore, it is difficult to control for all possible clinical characteristics and to focus only on those relevant to AAO. In a similar vein, we identified brain imaging-SNP pairs related to AAO in PD using only SNPs related to PD to make feature selection and interpretation tractable. To use the os-SCCA for other purposes, a similar reduction in feature dimension of the input modalities might be necessary especially for the genetic factors. Lastly, no replication using another dataset is a limitation of our study. Our approach requires multimodal data requirements and we could not find any open database satisfying our data requirements to the best of our knowledge.
In this study, we investigated relevant genetic factors and brain regions associated with the AAO of PD and suggested a prediction model using os-SCCA. Our study jointly models genetic factors and neuroimaging and thus could be a more suitable model compared to studies using genetic factors and neuroimaging only. Our proposed os-SCCA is a special case of TSCCA. Thus, the scalability of TSCCA applies. Our approach could be extended to included additional imaging modalities hence becoming four-way SCCA. TSCCA belongs to the family of linear models and thus could be easily scaled in terms of computational speed and size of data sets. We hope our study can be the first step toward accurate early diagnosis and improved therapy for PD.

code availability
The os-SCCA code is available through the repository at https ://githu b.com/Ji-Hye-Won/os-tscca .