Power Calculation of Multi-step Combined Principal Components with Applications to Genetic Association Studies

Principal component analysis (PCA) is a useful tool to identify important linear combination of correlated variables in multivariate analysis and has been applied to detect association between genetic variants and human complex diseases of interest. How to choose adequate number of principal components (PCs) to represent the original system in an optimal way is a key issue for PCA. Note that the traditional PCA, only using a few top PCs while discarding the other PCs, might significantly lose power in genetic association studies if all the PCs contain non-ignorable signals. In order to make full use of information from all PCs, Aschard and his colleagues have proposed a multi-step combined PCs method (named mCPC) recently, which performs well especially when several traits are highly correlated. However, the power superiority of mCPC has just been illustrated by simulation, while the theoretical power performance of mCPC has not been studied yet. In this work, we attempt to investigate theoretical properties of mCPC and further propose a novel and efficient strategy to combine PCs. Extensive simulation results confirm that the proposed method is more robust than existing procedures. A real data application to detect the association between gene TRAF1-C5 and rheumatoid arthritis further shows good performance of the proposed procedure.

Identification of genetic variants associated with human complex diseases can help investigators further understand genetic structure of diseases of interest. Compared with single-marker analysis, which tests every marker individually and is commonly employed in genome-wide association study, multiple-marker test has been well appreciated because of its potentially improved statistical power. Statistical methods for multiple-marker analysis can be summarized as synthesizing single-marker test statistics such as Hotelling's T 2 test 1-3 and summation of squared univariate test 4,5 , weighted Fourier transformation 6 , variance-components score test 7 , principal components regression method [8][9][10] , and Kernel-machine-based test 11 . Performances of these methods have been explored by intensive computer simulations 1,12,13 . Their results showed that when the number of SNPs is relatively large, variance-component-based methods and principal components regression methods were found to have competitive power.
As it is well known that principal component analysis (PCA) is a useful tool to search for important characteristics among correlated variables. A key issue in developing an effective PCA model is choosing an adequate number of principal components (PCs) to represent the system in an optimal way. Taking advantage of the size of variances, Hocking 14 provided a firm rule for retaining PCs in the framework of regression models. Usually, in PCA, investigators only used a few top principal components and discarded the other PCs. Recently, some investigators have illustrated that commonly used method for choosing PCs is not always reasonable. In fact, as early as in 1982, Jollife 15 showed an interesting counter-intuitive phenomenon that principal components explaining a small amount of variances can be as important as those explaining a large amount of variances when analyzing non-genetic data. Aschard et al. 16 confirmed this phenomenon when analyzing genetic data and proposed a called multi-step combined principal component (mCPC) strategy. However, the performance of mCPC strongly depends on how to partition all PCs.
Without loss of generality, suppose a random vector T follows a multivariate normal distribution with a m × 1 mean vector μ and known m × m covariance matrix V. We want to test the null hypothesis H 0 : μ = 0. Therefore, a Chi-squared statistic can be used for testing H 0 . However, when m is large, which is fairly common in genome-wide association studies, Chi-squared test might substantially lose power due to its large degrees of freedom. To reduce degrees of freedom, PCA is recommended. Based on orthogonal decomposition, we have is large, although the variance of PC 2 is less than that of PC 1 . So we can not discard any PCs arbitrarily. In order to test H 0 : μ = 0, Aschard et al. 17 proposed a multi-step combined principal component (mCPC) as following , where F k (·) is cumulative distribution function of a central Chi-squared random variable with k degrees of freedom. Moreover, they used simulation to compare the power of various PCA-based strategies when analyzing up to 100 correlated traits, and showed that their method with combining the signals across all PCs could have greater power. However, there has not been an in-depth study of the theoretical properties of mCPC in Aschard et al.'s paper 16 . Obviously, Aschard et al. 16 find an unusual way to fully utilize all PCs. Another key issue is to decide the value of k. A commonly used method for selecting k is based on cumulative contribution rates, which are equal to λ λ , and denoted by c k for =  k m 1, , , respectively. Let , for any c ∈ [0, 1]. Aschard et al. 16 followed the traditional way to use mCPC (k) with k being determined by cumulative contribution rate of 80%.
In this work, we focus on the theoretical power of mCPC and find that the maximum power of mCPC is related to the maximum noncentral parameters under alternative hypothesis. We also find that the noncentral parameter corresponding to the top PC (the first PC which corresponds to the largest eigenvalue) is greater than 0 under most scenarios and those of other PCs do not possess this property when only a few means of all PCs are non-zero under alternative and the correlation coefficients among original variables are relatively large. Herein we propose a method tCPC. Based on numerical results, the tCPC is more powerful than the existing procedures under most of the considered scenarios.

Results
Theoretical Properties of mCPC (k). For the multiple genetic variants association studies, the above random vector can be written as =   T  T T  T  ( , , , ) , where T i is the statistic that is used to test for the association between the phenotype of interested and the ith genetic variants, =  i m 1, 2, , . V is the covariance matrix of the random vector T. Through the eigen-decomposition of the covariance matrix, we have , and q i is the eigenvector corresponding to the eigenvalue λ i , =  i m 1, 2, , . Then we can obtain transformed statistics as Furthermore, under H 0 , Z i follows a central Chi-squared distribution with 1 degree of freedom for =  i m 1, 2, , . Under the alternative hypothesis, , a noncentral Chi-squared distribution with 1 degree of freedom and non-centrality be the inverse function of F i (·). Note that for any given x ∈ [0, 1], follow Chi-square distributions with 2 degrees of freedom, then mCPC (k) follows a central Chi-squared distribution with 4 degrees of freedom.
According to Sankaran 17 , probability density function of a noncentral Chi-squared distribution with d degrees of freedom and non-centrality parameter ξ is , and For any x > 0, and ∈  k m {1, 2, , }, the probability density function of , and the probability density function of ∑ x m k m k m k m k Let C 1−α be 1 − α quantile of a central Chi-squared distribution with 4 degrees of freedom. The power of mCPC (k) under the significance level α is are m − 1 real numbers. However, for a m-dimensional space, . Hence, the non-centrality parameter of the Chi-squared distribution of the statistic Z 1 is not equal to 0 almost everywhere. Besides this, and since the top PC possesses the largest variation among all PCs and k = 1 is a boundary point of the set consisting of  m 1, 2, , , herein we propose to use the following strategy (named tCPC) to combine all PCs Under null hypothesis of no association at any locus, tCPC follows a central Chi-squared distribution with 4 degrees of freedom.
Simulation Settings and Numerical Results. In this subsection, we conduct simulation studies to compare powers between tCPC to some exiting approaches such as Hotelling's T 2 test (HT) [1][2][3] 4 , sequence kernel association test (SKAT) 11 and multi-step combine principal component test mCPC (k 0.8 ) 17 .
Consider testing association between m genetic variants (or SNPs) and a complex human disease.
be a test statistic, where τ means the transpose of a vector or matrix. For example, we can construct T using the method in Chatterjee et al. 18 to detect genetic association between m SNPs and a binary trait as , in which V G is the pooled-sample covariance matrix of all SNPs.
In order to obtain T and V m×m , we first generate a latent vector with length of 20 from a multivariate normal distribution with covariance structures of a compound symmetry with equal pairwise correlation ρ. Then, this latent vector is dichotomized to yield a haplotype with predesignated minor allele frequency (MAF). We repeat the above process 100,000 times to form a large population. Without loss of generality, we designate the first SNP as disease-causal SNP with MAF being p, and other SNPs as noncausal SNPs with MAFs all being q. Both sizes of case samples and control samples are set to be 1,000. Case or control status of one subject is generated from a logistic regression model  Table 1. Table 1 shows that all these tests can control type I error rate correctly. Next we consider a decreasing correlation structure. As a preliminary step, a latent vector with length of 20 is generated from a multivariate normal distribution with covariance matrix being (ρ |i−j| ) 20×20 . Other simulation settings are similar as above and are shown in Table 2. As presented in Table 2 Tables 1 and 2 comprehensively, we can see that, when linkage disequilibrium extents among all SNPs are relatively strong, tCPC performs more robustly than existing statistical methods.
Applications to gene TRAF1-C5 associated with Rheumatoid Arthritis. We apply tCPC and the other five existing tests to detect the association between gene TRAF1-C5 and rheumatoid arthritis using the data from the Genetic Analysis Workshop 16 19 . Our goal is to detect whether there is an association between gene TRAF1-C5 and rheumatoid arthritis. This gene has been reported to be deleterious previously 20   as Burton et al. 21 , only the proposed tCPC can detect the moderate-strong association signal between the gene TRAF1-C5 and rheumatoid arthritis.

Discussion
Principal component analysis is a common tool to grasp important features of correlated variables and has been applied in genetic association studies. In principal component analysis, cumulative contribution rate of 80% or    90% is commonly adopted to choose PCs. However, this adoption is not always suitable since PCs with low contribution rate might be much more strongly correlated with the outcome than those with large contribution rate. To overcome this drawback, a mCPC method was developed recently 16 . In this study, we explored theoretical powers of mCPC deeply and find out that the maximum power of mCPC depends on the maximum noncentral

Covariance matrix Scenarios Mean vectors
Uniform correlation with ρ = 0.8 Decreasing correlation with ρ = 0.8 Decreasing correlation with ρ = 0.2 (S13) Table 3. Parameter settings about means and covariance matrices. parameters of Chi-squared distributions for all PCs under the alternative hypothesis. However, it is difficult to obtain this information beforehand in practice. In view of this, we propose a novel and robust strategy to combine PCs. We also propose a test for genome-wide association studies and compare powers of this test to mCPC (k 0.8 ) and some other existing procedures such as Hotelling's T 2 test (HT), oPC (k 0.8 ) SSU and SKAT by extensive simulations. All simulation results show that our proposed procedure is more robust than mCPC, HT, oPC (k 0.8 ), SSU and SKAT. Results of real data analysis further demonstrates good performances of our proposed test. We suggest researchers to employ our robust strategy when they consider using principal component analysis method in the future. i S1-S4 S1 S2 S3 S4 S5-S8 S5 S6 S7 S8  Table 4. Eigenvalues, cumulative contribution rates and non-centrality parameters for Scenarios (S1) to (S8).
i S9-S12 S9 S10 S11 S12 S13-S16 S13 S14 S15 S16  Table 5. Eigenvalues, cumulative contribution rates and non-centrality parameters for Scenarios (S9) to (S16). It should be noted that our proposed procedure is built upon test by Chatterjee et al. 18 which was designed to detect association between a marker and a binary trait. It can be easily extended to other application fields. For instance, it has been used in pleiotropic genetic study to identify deleterious genetic variants associated with multiple traits 16 . In addition, our proposed test can also be used to detect the association between genetic variants and quantitative traits in framework of linear model, and ordinal traits on basis of proportional odds model. If quantitative traits do not follow normal distribution, one can consider constructing a multivariate nonparametric trend test 22 and then employ our proposed strategy to combine them.

Methods
Maximum Powers of mCPC and ordinary PCA over extensive scenarios. For fixed m, the powers of mCPC (k) mainly depend on k, μ and V m×m . We set different mean vectors under the alternative hypothesis among different covariance matrices V m×m . We also consider two types of V m×m : one is that m-dimension variables are uniformly correlated, which means covariance matrix V m×m is a symmetry positively definite matrix with diagonal elements all being 1 and non-diagonal elements all being ρ; the other is that all correlations among these m variables are decreasing considering the "physical" distance (SNP location), which means V m×m = (ρ |i−j| ) m×m . Without loss of generality, ρ is chosen to be 0.8 for strong linkage disequilibrium and 0.2 for weak linkage disequilibrium. Here, we consider m = 20. Note that, a test based on ordinary PCA can be gained, which is denoted by oPC (k) with , where =  k m 1, 2, , . Obviously, powers of oPC (k) are also affected by k when Ω Ω  , , m 1 are given. In order to view powers of mCPC (k) and oPC (k) comprehensively, we set α to be 0.05, and calculate powers of mCPC (k) and oPC (k) by numerical integration in R software under scenarios S1 to S16. All parameter settings about scenario S1 to S16 are displayed in Table 3. We calculate eigenvalues of V m×m , c i , Ω i of all scenarios S1 to S16 for =  i m 1, , , and display all results in Tables 4 and 5. All power results of mCPC and ordinary PCA are displayed in Figs 1-4. Under the same correlation structure and mean vector, the powers of mCPC (k) and oPC (k) are affected strongly by selection of k. From Tables 3-5 and Figs 1-4, we can find that both the maximum powers of mCPC (k) and oPC (k) are related to the maximum non-centrality parameters of all PCs as k is from 1 to 20. For example, in Scenario S15, the non-centrality parameter of the second PC is the maximum among those of all PCs, and mCPC (2) has the maximum power. In another example, under the scenario S2, the non-centrality parameter of the third PC is the maximum, mCPC (4) has the maximum power, and powers of mCPC (3) and mCPC (4) are close. We can also find out that ordinary way to select k to construct mCPC (k) is not always desirable. For example, in Scenario S12, k 0.8 = 5, but oPC (5) has power as low as 0.115. It is verified that mCPC is more desirable than ordinary PCA based on Figs 1-4. It is also verified that the selection of k according to the cumulative contribution rate is not robust. One can just follow the common adoption and choose c = 80% or 90% for oPC (k c ), but it will result in loss of power substantially under some situations. Furthermore, we draw a conclusion that mCPC (k c ) performs more robust than oPC (k c ) similar, since mCPC (k c ) has reasonable powers over all the considered scenarios. For example, in Scenario S2, these 20 variables are uniformly correlated with ρ = 0.8 and μ μ = = = .  0 8 1 1 0 , μ μ = = =  0 11 20 , the power of oPC (k 0.8 ) is 0.073, which is far less than the power of mCPC (k c ), which is 0.57.
A novel robust strategy to combine PCs. A further investigation of the maximum powers of mCPC (k) and oPC (k) shows that both of them are related to non-centrality parameters of the Chi-square distributions under the alternative hypothesis. For example, about scenario S16 in Table 5, the non-centrality parameters of all 20 PCs are 0.01, 0.05, 0.11, 0.19, 0.27, 0.37, 0.45, 0.53, 0.58, 0.61, 0.62, 0.60, 0.55, 0.48, 0.38, 0.30, 0.21, 0.12, 0.06 and 0.01 respectively, and non-centrality parameter being 0.62 which belongs to the 11th PC is the largest one among the non-centrality parameters of all 20 PCs. mCPC (10) takes the maximum power with 0.248 and the power of mCPC (11) is 0.247, which is very close to that of mCPC (10). The difference maybe are caused by numerical computing errors. The cumulative contribution rate of the top 10 PCs are 62.25%, which is much less than 80%. It is worth noted that the non-centrality parameters are determined by means and covariance matrix, which are hard to know in practice. Therefore, if we can know some prior information on means and covariance matrix, then the optimal strategy for selection of k become more prone to obtain. Aschard et al. 17 proposed to use mCPC (k) with k being determined by cumulative contribution rate of 80%.