Pattern Discovery in Brain Imaging Genetics via SCCA Modeling with a Generic Non-convex Penalty

Brain imaging genetics intends to uncover associations between genetic markers and neuroimaging quantitative traits. Sparse canonical correlation analysis (SCCA) can discover bi-multivariate associations and select relevant features, and is becoming popular in imaging genetic studies. The L1-norm function is not only convex, but also singular at the origin, which is a necessary condition for sparsity. Thus most SCCA methods impose \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\ell }_{{\bf{1}}}$$\end{document}ℓ1-norm onto the individual feature or the structure level of features to pursuit corresponding sparsity. However, the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\ell }_{{\bf{1}}}$$\end{document}ℓ1-norm penalty over-penalizes large coefficients and may incurs estimation bias. A number of non-convex penalties are proposed to reduce the estimation bias in regression tasks. But using them in SCCA remains largely unexplored. In this paper, we design a unified non-convex SCCA model, based on seven non-convex functions, for unbiased estimation and stable feature selection simultaneously. We also propose an efficient optimization algorithm. The proposed method obtains both higher correlation coefficients and better canonical loading patterns. Specifically, these SCCA methods with non-convex penalties discover a strong association between the APOE e4 rs429358 SNP and the hippocampus region of the brain. They both are Alzheimer’s disease related biomarkers, indicating the potential and power of the non-convex methods in brain imaging genetics.

By identifying the associations between genetic factors and brain imaging measurements, brain imaging genetics intends to model and understand how genetic factors influence the structure or function of human brain [1][2][3][4][5][6][7][8][9][10][11][12][13][14] . Both genetic biomarkers such as single nucleotide polymorphisms (SNPs), and brain imaging measurements such as imaging quantitative traits (QTs) are multivariate. To address this problem, bi-multivariate association models, such as multiple linear regression 15 , reduced rank regression [16][17][18] , parallel independent component analysis 19 , partial least squares regression 20,21 , canonical correlation analysis (CCA) 22 and their sparsity-inducing variants 23 , have been widely used to uncover the joint effect of multiple SNPs on one or multiple QTs. Among them, SCCA (Sparse CCA), which can discover bi-multivariate relationships and extract relevant features, is becoming popular in brain imaging genetics.
The CCA technique has been introduced for several decades 24 . CCA can only perform well when the number of observations is larger than the combined feature number of the two views. Unfortunately, the problem usually is a large-p-small-n problem in the biomedical and biology studies. And it gets even worse because in CCA we are facing a large-(p + q)-small-n problem. In order to overcome this limitation, sparse CCA (SCCA) 25-36 employs a sparsity inducing regularization term to select a small set of relevant features and has received increasing attention. The 1  -norm based SCCA method 25 has gained great success for its sparsity pursuing capability. After that, there are many SCCA variants based on the 1  -norm. For examples, the fused lasso penalty imposes the 1  -norm onto the ordered pairwise features 25 , and the group lasso penalty imposes the 1  -norm onto the group of features 29,32 . Further, the graph lasso or the graph guided lasso can be viewed as imposing the  1 -norm onto the pairwise features defined by an undirected graph 29 .
However, the 1  -norm penalty shows the conflict of optimal prediction and consistent feature selection 37 . In penalized least squares modeling, Fan and Li 38 showed that a good penalty function should meet three properties.

Methods
Throughout this paper, scalars are denoted as italic letters, column vectors as boldface lowercase letters, and matrices as boldface capitals. The u denotes the Euclidean norm of a vector u. (SCCA). Let ∈ ×  X n p be a matrix representing the SNP biomarkers data, where n is the number of participants and p is the number of SNPs. Let Y n q  ∈ × be the QT data with q being the number of imaging measurements. A typical SCCA model is defined as where Xu and Yv are the canonical variables, u and v are the corresponding canonical vectors we desire to estimate, and c 1 , c 2 are the tuning parameters that control the sparsity level of the solution. The penalty function could be the  1 -norm penalty, or its variants such as the fused lasso, group lasso and graph lasso 25,27,29,32,34 .

Preliminaries. Sparse Canonical Correlation Analysis
Non-convex Penalty Functions for SCCA. In this paper, we investigate seven non-convex surrogate penalties of  0 -norm in the SCCA model. They are singular at the origin, which is essential to achieve sparsity in the solution. And they do not overly penalize large coefficients. In order to facilitate a unified description, we denote the non-convex penalty as where λ and γ are nonnegative parameters, and P λ,γ (|u i |) is a non-convex function. We absorb λ into the penalty because it cannot be decoupled from several penalties, such as the SCAD function 38 . We here have seven penalties and they are described in Table 1 and visualized in Fig. 1, where for clarity we have dropped the subscript i in u i . There is a sharp point at the origin for each of them, indicating that they are singular at the origin. This is essential to achieve sparseness in the solution. Besides, these curves are concave in |u i | and monotonically decreasing on (−∞, 0], and monotonically increasing on [0, ∞). Therefore, though these penalties are not convex, they are piecewise continuously differentiable and their supergradients exist on both (−∞, 0] and [0, ∞) 49 . Table 1 also shows their supergradients P′ λ,γ (|u i |) with respect to |u i |.
The Proposed Non-convex SCCA Model and Optimization Algorithm. Replacing the 1  -norm constraints in the SCCA model, we define the unified non-convex SCCA model as follows where Ω nc (u) and Ω nc (v) can be any of the non-convex functions listed in Table 1.
To solve the non-convex SCCA problem, we use the Lagrangian method, Scientific RepoRts | 7: 14052 | DOI:10.1038/s41598-017-13930-y from the point of view of optimization. α 1 , α 2 , λ 1 , λ 2 and γ are nonnegative tuning parameters. Next we will show how to solve this non-convex problem. The first term −u Τ X Τ Yv on the right of equation (5) is biconvex in u and v. Xu 2 is convex in u, and Yv 2 is convex in v. It remains to approximate both Ω nc (u) and Ω nc (v) and transform them into convex ones.
The local quadratic approximation (LQA) technique was introduced to quadratically expresses the SCAD penalty 38 . Based on LQA, we here show how to represent these non-convex penalties in a unified way. First, we have the first-order Taylor expansion of P ( ) where μ 0 and μ are neighbors, e.g., the estimates at two successive iterations during optimization. Substituting μ = u i 2 and μ = u ( ) i t 0 2 into (6), we have , , (as shown in Table 1) at u i t | |. Then we obtain a quadratic approximation to Ω nc (u): Penalty Name Function (P λ,γ (|u|)) Supergradient (P′λ,γ(|u|)) is not a function of u and thus will not contribute to the optimization. In a similar way, we can construct a quadratic approximation to Ω nc (v) is not a function of v and makes no contribute towards the optimization.
Denote the estimates of u and v in the t-th iteration as u t and v t , respectively. To update the estimates of u and v in the (t + 1)-th iteration, we substitute the approximate functions of Ω nc (u) and Ω nc (v) in equations (8) and (9) into (u, v) in 5, and solve the resultant approximate version of the original problem: Obviously, the equation (10) is a quadratical expression, and is biconvex in u and v. This means it is convex in terms of u given v, and vice versa. Then according to the alternate convex search (ACS) method which is designed to solve biconvex problems 48 , the (t + 1)-th estimation of u and v can be calculated via Both equations above are quadratic, and thus their closed-form solutions exist. Taking the partial derivative of (u, v) in (5) with respect to u and v and setting the results to zero, we have (i∈ [1, p]). It can be calculated by taking the partial derivative of equation (7) with respect to u i . D t 2 is also a diagonal matrix with the j-th diagonal entry as (j∈ [1, q]), and can be computed similarly. However, the i-th element According to perturbed version of LQA 50 , we address this by adding a slightly perturbed term. Then the i-th ele- where ζ is a tiny positive number. Hunter and Li 50 showed that this modification guarantees optimizing the equation (11). Then we have the updating expressions at the (t + 1)-th iteration   Table 2. Participant characteristics.
We alternate between the above two equations to graduate refine the estimates for u and v until convergence. The pseudo code of the non-convex SCCA algorithm is described in Algorithm 1.
Computational Analysis. In Algorithm 1, Step 3 and Step 6 are linear in the dimension of u and v, and are easy to compute. Step 4 and Step 7 are the critical steps of proposed algorithm. Since we have closed-form updating expressions, they can be calculated via solving a system of linear equations with quadratic complexity which avoids computing the matrix inverse with cubic complexity.
Step 5 and 8 are the re-scale step and very easy to calculate. Therefore, the whole algorithm is efficient. Data Availability. The synthetic data sets generated in this work are available from the corresponding authors' web sites, http://www.escience.cn/people/dulei/code.html and http://www.iu.edu/ shenlab/tools/ncscca/. The real data set is publicly available in the Alzheimer's Disease Neuroimaging Initiative (ADNI) database repository, http://adni.loni.usc.edu.

Experiments and Datasets
Data Description. Synthetic Dataset. There are four data sets with sparse true signals for both u and v, i.e., only a small subset of features are nonzero. The number of features of both u and v are larger than the observations to simulate a large-(p + q)-small-n task. The generating process is as follows. We first generate u and v with most feature being zero. After that, the latent variable z is constructed from Gaussian distribution N(0, I n × n ). Then we create the data X from N z The first three sets have 250 features for u and 600 ones for v, but they have different correlation coefficients. There are 500 features and 900 features in u and v respectively for the last data set. We show the true signal of every data set in Fig. 2 (top row).
Real Neuroimaging Genetics Dataset. Data used in the preparation of this article were obtained from the ADNI database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA) etc, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). For up-to-date information, see www.adni-info.org. The study protocols were approved by the Institutional Review Boards of all participating centers (Northwestern Polytechnical University, Indiana University and ADNI (A complete list of ADNI sites is available at http://www.adni-info.org/)) and written informed consent was obtained from all participants or authorized representatives. All the analyses were performed on the de-identified ADNI data, and were determined by Indiana University Human Subjects Office as IU IRB Review Not Required.
The real neuroimaging genetics dataset were collected from 743 participants, and the details was presented in Table 2. There were 163 candidate SNP biomarkers from the AD-risk genes, e.g., APOE, in the genotyping data. The structural MRI scans were processed with voxel-based morphometry (VBM) in SPM8 51,52 . Briefly, scans were aligned to a T1-weighted template image, segmented into gray matter (GM), white matter (WM) and  Table 3. The searching range of optimal γ for each non-convex penalty.
cerebrospinal fluid (CSF) maps, normalized to MNI space, and smoothed with an 8mm FWHM kernel. We subsampled the whole brain and generated 465 voxels spanning the whole brain ROIs. The regression technique was employed to remove the effects of the baseline age, gender, education, and handedness for these VBM measures. The aim of this study is to evaluate the correlation between the SNPs and the VBM measures, and further identify which SNPs and ROIs are associated.
Experimental Setup. Benchmarks. In this paper, we are mainly interested in whether these non-convex SCCA methods could enhance the performance of  1 -SCCA method based on our motivation. It is reasonable to employ the 1  -norm based methods in comparison. Therefore, the structure-aware SCCA methods such as 28,29,32,34 are not contained here as benchmark. Based on different mathematical techniques, there are three different  1 -SCCA algorithms. They are the singular value decomposition based method 25 , the primal-dual based method 29 and the LQA based method 32 . Though the latter two are proposed for capturing group or network structure, they can be easily reformulated to the  1 -norm constrained methods, such as setting the parameters associated with the structure penalty to zero 29 . Therefore, to make the comparison fair and convincing, we choose all of them as benchmarks. With a slight abuse of notation, we use the penalty name to refer a non-convex SCCA method, e.g. ETP for ETP based SCCA method. For the  1 -norm based methods, we call them L1-SCCA 25 , L1-S2CCA 32 , and L1-NSCCA 29 .
Parameter Tuning. There are four parameters λ i (i = 1, 2) and α i (i = 1, 2) associated with the non-convex SCCA methods, and one pivotal parameter γ. According to their equations, these non-convex penalties can approximate the  0 -norm by providing an appropriate γ. In this situation, the λ i and α i play a very weak role because theoretically the 0  -norm penalized problem does not rely on the parameters. Based on this consideration, we here only tune the γ other than tuning λ i and α i by a grid search strategy. This reduces the time consumption dramatically but does not affect the performance significantly. Further, we observe that two γ's perform similarly if they are not significantly different. Thus the tuning range of γ is not continuous. Besides, we set γ = 3.7 for SCAD penalty since 38 suggested that this is a very reasonable choice. The details of tuning range for each penalty are contained in Table 3. For λ i and α i , we simply set them to 1 in this study.
as the termination condition for Algorithm 1, where ε is the user defined error bound. In this study, we set ε = 10 −5 according to experiments. All  Table 5. Training and testing correlation coefficients (mean ± std) of 5-fold cross-validation synthetic data sets.
The best values are shown in boldface.
Results on Synthetic Data. Figure 2 shows the heat maps of canonical loadings estimated from all SCCA methods, where each row corresponds to an experimental method. We clearly observe that the non-convex SCCA methods and L1-SCCA correctly identify the identical signal positions to the ground truth across four data sets. Besides true signals, L1-SCCA introduces several undesired signals which makes it be inferior to our methods. As  . Mapping averaged canonical weight v's estimated by every SCCA method onto the brain. The left panel and right panel show five methods respectively, where each row corresponds to a SCCA method. The L1-SCCA identifies the most signals, followed by the L1-NSCCA and L1-S2CCA. All the proposed methods identify a clean signal that helps further investigation.
Scientific RepoRts | 7: 14052 | DOI:10.1038/s41598-017-13930-y a contrast, L1-NSCCA finds out an incomplete proportion of the ground truth, and L1-S2CCA performs unstably as it fails on some folds. Moreover, we also prioritize these methods using the AUC (area under ROC) criterion in Table 4, where a higher value indicates a better performance. The results exhibit that the non-convex SCCA methods have the highest score at almost every case. L1-SCCA scores similarly to the proposed methods, but later we can see it pays the price at a reduced prediction ability. Table 5 presents the estimated correlation coefficients on both training and testing data, where the best values are shown in boldface. The proposed SCCA methods alternatively gain the best value, and the Log method wins out for the most times. This demonstrates that the non-convex methods outperform  1 -norm based SCCA methods in terms of the prediction power. In summary, the proposed methods identify accurate and sparse canonical loading patterns and obtain high correlation coefficients simultaneously, while those 1  -norm based SCCA methods cannot.

Results on Real Neuroimaging Genetics Data.
In this real data study, the genotyping data is denoted by X, and the imaging data is denoted by Y. The u is a vector of weights of all SNPs, and v is a vector of weights of all imaging markers.The canonical correlation coefficients are defined as Pearson correlation coefficient between Xu and Yv, i.e., Τ Xu Yv Xu Yv ( ) /( ). Figure 3 presents the heat maps regrading the canonical loadings generated from the training set. In this figure, each row shows two weights of a SCCA method, where a larger weight stands for a more importance. The weight associated with the SNPs is on the left panel, and that associated with the voxels is on the right. The proposed non-convex SCCA methods obtain very clean and sparse weights for both u and v. The largest signal on the genetic side is the APOE e4 SNP rs429358, which has been previously reported to be related to AD 53 . On the right panel, the largest signal for all SCCA methods comes from the hippocampus region. This is one of the most notable biomarkers as an indicator of AD, since atrophy of hippocampus has been shown to be related to brain atrophy and neuron loss measured with MRI in AD cohort 53 . In addition, the L1-S2CCA and SCAD methods identify a weak signal from the parahippocampal gyrus, which is previously reported as an early biomarker of AD 54 . On some folds, the Log method also finds out the lingual region, parahippocampal gyrus, vermis region. Interestingly, all the three regions have shown to be correlated to AD, and could be further considered as an indicating biomarker that can be observed prior to a dementia diagnosis. For example, Sjöbeck and Englund reported that molecular layer gliosis and atrophy in the vermis are clearly severer in AD patients than in the health controls 55 . This is meaningful since the non-convex SCCA methods identify the correct clue for further investigation. On this account, both L1-SCCA and L1-NSCCA are not good choices since they identify too many signals, which may misguide subsequent investigation. The figure shows that L1-S2CCA could be an alternative choice for sparse imaging genetics analysis, but it performs unstably across the five folds. And, the non-convex methods is more consistent and stable than those  1 -SCCA methods. To show the results more clearly, we map the canonical weights (averaged across 5 folds) regarding the imaging measurements from each SCCA method onto the brain in Fig. 4. The figure confirms that the L1-SCCA and L1-NSCCA find out many signals that are not sparse. The L1-S2CCA identifies fewer signals than both L1-SCCA and L1-NSCCA, but more than all these non-convex SCCA methods. All the non-convex SCCA only highlights a small region of the whole brain. This again reveals that the proposed methods have better canonical weights which reduces the effort of further investigation.
Besides, we include both training and testing correlation coefficients in Table 6, where their mean and standard deviation are shown. The training results of all methods are similar, with the Log method gains the highest value of 0.33 ± 0.03. As for the testing results, which is our primary interest, all the non-convex SCCA methods obtain better values than these 1  -SCCA methods. Besides, the difference between the training and testing performance of the proposed methods is much smaller than that of three  1 -SCCA methods. This means that the non-convex methods have better generalization performance as they are less likely to fall into overfitting issue. The result of this real imaging genetics data reveals that the proposed SCCA methods can extract more accurate and sparser canonical weights for both genetic and imaging biomarkers, and obtain higher correlation coefficients than those 1  -SCCA methods.

Conclusion
We have proposed a unified non-convex SCCA model and an efficient optimization algorithm using a family of non-convex penalty functions. These penalties are concave and piecewise continuous, and thus piecewise differentiable. We approximate these non-convex penalties by an 2  function via the local quadratic approximation (LQA) 38 . Therefore, the proposed algorithm is effective and runs fast.
We compare the non-convex methods with three state-of-the-art  1 -SCCA methods using both simulation data and real imaging genetics data. The simulation data have different ground truth structures. The results on the simulation data show that the non-convex SCCA methods identify cleaner and better canonical loadings than the three 1  -SCCA methods, i.e. L1-SCCA 25 , L1-S2CCA 32 , and L1-NSCCA 29 . These non-convex methods also recover  Table 6. Performance comparison on real data set. Training and testing correlation coefficients (mean ± std) of 5-fold cross-validation are shown. The best value is shown in boldface.
Scientific RepoRts | 7: 14052 | DOI:10.1038/s41598-017-13930-y higher correlation coefficients than 1  -SCCA methods, demonstrating that 1  -SCCA methods have suboptimal prediction capability as they may over penalize large coefficients. The results on the real data show that the proposed methods discover a pair of meaningful genetic and brain imaging biomarkers, while the 1  -SCCA methods return too many irrelevant signals. The correlation coefficients show that the non-convex SCCA methods hold better testing values. This verifies our motivation that the non-convex penalty can improve the prediction ability, and thus has better generalization capability. Obviously, the parameter γ plays a key role in these non-convex penalties. In the future work, we will investigate how to choose a reasonable γ; and explore how to incorporate structure information into the model as structure information extraction is an important task for brain imaging genetics as well as biology studies.