Introduction

Identification of genetic variants associated with human complex diseases can help investigators further understand genetic structure of diseases of interest. Compared with single-marker analysis, which tests every marker individually and is commonly employed in genome-wide association study, multiple-marker test has been well appreciated because of its potentially improved statistical power. Statistical methods for multiple-marker analysis can be summarized as synthesizing single-marker test statistics such as Hotelling’s T2 test1,2,3 and summation of squared univariate test4,5, weighted Fourier transformation6, variance-components score test7, principal components regression method8,9,10, and Kernel-machine-based test11. Performances of these methods have been explored by intensive computer simulations1,12,13. Their results showed that when the number of SNPs is relatively large, variance-component-based methods and principal components regression methods were found to have competitive power.

As it is well known that principal component analysis (PCA) is a useful tool to search for important characteristics among correlated variables. A key issue in developing an effective PCA model is choosing an adequate number of principal components (PCs) to represent the system in an optimal way. Taking advantage of the size of variances, Hocking14 provided a firm rule for retaining PCs in the framework of regression models. Usually, in PCA, investigators only used a few top principal components and discarded the other PCs. Recently, some investigators have illustrated that commonly used method for choosing PCs is not always reasonable. In fact, as early as in 1982, Jollife15 showed an interesting counter-intuitive phenomenon that principal components explaining a small amount of variances can be as important as those explaining a large amount of variances when analyzing non-genetic data. Aschard et al.16 confirmed this phenomenon when analyzing genetic data and proposed a called multi-step combined principal component (mCPC) strategy. However, the performance of mCPC strongly depends on how to partition all PCs.

Without loss of generality, suppose a random vector T follows a multivariate normal distribution with a m × 1 mean vector μ and known m × m covariance matrix V. We want to test the null hypothesis H0: μ = 0. Therefore, a Chi-squared statistic can be used for testing H0. However, when m is large, which is fairly common in genome-wide association studies, Chi-squared test might substantially lose power due to its large degrees of freedom. To reduce degrees of freedom, PCA is recommended. Based on orthogonal decomposition, we have V = QΛQτ, where with , , and qi is called as the eigenvector corresponding to the eigenvalue λi, . Define for . We note that Zi is related to the ith PC for . Under H0, , a central Chi-squared distribution with 1 degree of freedom for . Under the alternative hypothesis, , a noncentral Chi-squared distribution with 1 degree of freedom and non-centrality parameter , for .

For simplicity with m = 2, we consider a linear model with a normally distributed phenotype Y, which depends on two scaled genotypes G1 and G2 that are also normally distributed with mean 0 and variance 1. So the phenotype can be expressed as: Y = β0 + G1β1 + G2β2 + ε, where ε is the random error term which is distributed from the standard normal distribution. For this general model, the principal components of these two genotypes are , and . By some algebras, we can get , and . Phenotype Y can be reexpressed as: , which indicates that PC2 may be very important as is large, although the variance of PC2 is less than that of PC1. So we can not discard any PCs arbitrarily. In order to test H0: μ = 0, Aschard et al.17 proposed a multi-step combined principal component (mCPC) as following , where Fk (·) is cumulative distribution function of a central Chi-squared random variable with k degrees of freedom. Moreover, they used simulation to compare the power of various PCA-based strategies when analyzing up to 100 correlated traits, and showed that their method with combining the signals across all PCs could have greater power. However, there has not been an in-depth study of the theoretical properties of mCPC in Aschard et al.’s paper16. Obviously, Aschard et al.16 find an unusual way to fully utilize all PCs. Another key issue is to decide the value of k. A commonly used method for selecting k is based on cumulative contribution rates, which are equal to , and denoted by ck for , respectively. Let , for any c [0, 1]. Aschard et al.16 followed the traditional way to use mCPC (k) with k being determined by cumulative contribution rate of 80%.

In this work, we focus on the theoretical power of mCPC and find that the maximum power of mCPC is related to the maximum noncentral parameters under alternative hypothesis. We also find that the noncentral parameter corresponding to the top PC (the first PC which corresponds to the largest eigenvalue) is greater than 0 under most scenarios and those of other PCs do not possess this property when only a few means of all PCs are non-zero under alternative and the correlation coefficients among original variables are relatively large. Herein we propose a method tCPC. Based on numerical results, the tCPC is more powerful than the existing procedures under most of the considered scenarios.

Results

Theoretical Properties of mCPC (k)

For the multiple genetic variants association studies, the above random vector can be written as , where Ti is the statistic that is used to test for the association between the phenotype of interested and the ith genetic variants, . V is the covariance matrix of the random vector T. Through the eigen-decomposition of the covariance matrix, we have V = QΛQτ, where with , , and qi is the eigenvector corresponding to the eigenvalue λi, . Then we can obtain transformed statistics as , . Furthermore, under H0, Zi follows a central Chi-squared distribution with 1 degree of freedom for . Under the alternative hypothesis, , a noncentral Chi-squared distribution with 1 degree of freedom and non-centrality parameter Ωi, .

For , let be the inverse function of Fi (·). Note that for any given x [0, 1], . Under H0, both , and follow uniform distribution on [0, 1] and they are independent to each other. So both , and follow Chi-square distributions with 2 degrees of freedom, then mCPC (k) follows a central Chi-squared distribution with 4 degrees of freedom.

According to Sankaran17, probability density function of a noncentral Chi-squared distribution with d degrees of freedom and non-centrality parameter ξ is . Denote , and . For any x > 0, and , the probability density function of is

where , and the probability density function of is

with .

Let C1−α be 1 − α quantile of a central Chi-squared distribution with 4 degrees of freedom. The power of mCPC (k) under the significance level α is

Based on the above notations, the non-centrality parameter of the distribution of Z1 which corresponds to the first PC is , where μ is the mean vector of T under the alternative hypothesis. Ω1 = 0 if and only if the mean vector μ belongs to the space that expanded by the other m − 1 eigenvectors , that is , are m − 1 real numbers. However, for a m-dimensional space, . Hence, the non-centrality parameter of the Chi-squared distribution of the statistic Z1 is not equal to 0 almost everywhere. Besides this, and since the top PC possesses the largest variation among all PCs and k = 1 is a boundary point of the set consisting of , herein we propose to use the following strategy (named tCPC) to combine all PCs

Under null hypothesis of no association at any locus, tCPC follows a central Chi-squared distribution with 4 degrees of freedom.

Simulation Settings and Numerical Results

In this subsection, we conduct simulation studies to compare powers between tCPC to some exiting approaches such as Hotelling’s T2 test (HT)1,2,3, ordinary PCA , summation of squared univariate test statistic (SSU)4, sequence kernel association test (SKAT)11 and multi-step combine principal component test mCPC (k0.8)17.

Consider testing association between m genetic variants (or SNPs) and a complex human disease. Let be a test statistic, where τ means the transpose of a vector or matrix. For example, we can construct T using the method in Chatterjee et al.18 to detect genetic association between m SNPs and a binary trait as

where gi,j denotes the genotype of the jth SNP for the ith individual and n1 and n2 are the sample size of the case group and control group, respectively. Under the null hypothesis that these m SNPs are not associated with the disease of interest, T follows a multivariate normal distribution N (μm×1, Vm×m) asymptotically with the mean vector μm×1 and covariance matrix , in which VG is the pooled-sample covariance matrix of all SNPs.

In order to obtain T and Vm×m, we first generate a latent vector with length of 20 from a multivariate normal distribution with covariance structures of a compound symmetry with equal pairwise correlation ρ. Then, this latent vector is dichotomized to yield a haplotype with predesignated minor allele frequency (MAF). We repeat the above process 100,000 times to form a large population. Without loss of generality, we designate the first SNP as disease-causal SNP with MAF being p, and other SNPs as noncausal SNPs with MAFs all being q. Both sizes of case samples and control samples are set to be 1,000. Case or control status of one subject is generated from a logistic regression model

with β0 = −1.5, β1 {ln (1.4), ln (1.6)} denoting the log odds ratio for the disease-causal SNP, and denoting log odds ratios for the non-causal SNPs, where Yi = 1 or 0 represents disease or healthy status of the ith individual, . The nominal significance level is 0.05 throughout the whole simulation, and the number of replicates is 1,000. All parameter settings and their relevant results are displayed in Table 1. Table 1 shows that all these tests can control type I error rate correctly. For example, when the correlation coefficient of these 20 SNPs are equal to ρ = 0.50, β1 = 0 and p = q = 0.20, the empirical type I error rates of HT, oPC (k0.8), SSU, SKAT, mCPC (k0.8), and tCPC are 0.051, 0.052, 0.043, 0.049, 0.051, and 0.045, respectively. The results of power comparison shows that tCPC performs more robustly than the other methods. For example, when the correlation coefficients of these 20 SNPs are uniformly equal to 0.20 and β1 = ln 1.4, p = q = 0.20, the empirical powers of HT, oPC (k0.8), SSU, SKAT, mCPC (k0.8), and tCPC are 0.754, 0.736, 0.772, 0.766, 0.700, and 0.766, respectively. The empirical power of tCPC is a little lower than that of SSU in this scenario. However, when ρ = 0.50, β1 = ln 1.4, and p = q = 0.20, the empirical powers of HT, oPC (k0.8), SSU, SKAT, mCPC (k0.8), and tCPC are 0.724, 0.756, 0.810, 0.816, 0.702, and 0.856, respectively. It is obvious that tCPC performs the best among all the considered procedures in this setting.

Table 1 Empirical type I error rates and powers of HT, oPC (k0.8), SSU, SKAT, mCPC (k0.8) and tCPC for Constant correlations.

Next we consider a decreasing correlation structure. As a preliminary step, a latent vector with length of 20 is generated from a multivariate normal distribution with covariance matrix being (ρ|ij|)20×20. Other simulation settings are similar as above and are shown in Table 2. As presented in Table 2, the empirical powers of all the tests are close to the nominal significance level which indicates that they can control type I error rate correctly. For instance, when ρ = 0.50, β1 = 0 and p = q = 0.20, the empirical type I error rates of HT, oPC (k0.8), SSU, SKAT, mCPC (k0.8), and tCPC are 0.036, 0.032, 0.04, 0.044, 0.038, and 0.042, respectively. For power comparison, tCPC still performs more robustly than the other methods. For example, when correlations of these 20 SNPs are decreasing with distance as ρ = 0.80, β1 = ln 1.4, p = 0.20, and q = 0.20, the empirical powers of HT, oPC (k0.8), SSU, SKAT, mCPC (k0.8), and tCPC are 0.77, 0.846, 0.772, 0.762, 0.788, and 0.836, respectively. The empirical power of tCPC is a little lower than that of oPC (k0.8) in this case. However, when ρ = 0.95, β1 = ln 1.4, p = 0.20, and q = 0.20, the powers of HT, oPC (k0.8), SSU, SKAT, mCPC (k0.8) and tCPC are 0.746, 0.898, 0.858, 0.858, 0.86 and 0.864, respectively. It indicates that tCPC gives the maximum power among the six methods in this scenario. Compared Tables 1 and 2 comprehensively, we can see that, when linkage disequilibrium extents among all SNPs are relatively strong, tCPC performs more robustly than existing statistical methods.

Table 2 Empirical type-1 error rates and powers of HT, oPC (k0.8), SSU, SKAT, mCPC (k0.8) and tCPC for decreasing correlations.

Applications to gene TRAF1-C5 associated with Rheumatoid Arthritis

We apply tCPC and the other five existing tests to detect the association between gene TRAF1-C5 and rheumatoid arthritis using the data from the Genetic Analysis Workshop 1619. Our goal is to detect whether there is an association between gene TRAF1-C5 and rheumatoid arthritis. This gene has been reported to be deleterious previously20. There are 2,062 subjects including 868 cases and 1,194 controls in this study. The gene TRAF1-C5 consists of 38 SNPs. The p-values of HT, oPC (0.8), SSU, SKAT, mCPC (k0.8) and tCPC of detecting associations between gene TRAF1-C5 and rheumatoid arthritis are 5.21 × 10−5, 7.58 × 10−3, 5.95 × 10−4, 6.50 × 10−5, 7.56 × 10−5 and 3.75 × 10−5, respectively. If we use the p-value threshold of 5 × 10−5 as the moderate association at the genome-wide level as Burton et al.21, only the proposed tCPC can detect the moderate-strong association signal between the gene TRAF1-C5 and rheumatoid arthritis.

Discussion

Principal component analysis is a common tool to grasp important features of correlated variables and has been applied in genetic association studies. In principal component analysis, cumulative contribution rate of 80% or 90% is commonly adopted to choose PCs. However, this adoption is not always suitable since PCs with low contribution rate might be much more strongly correlated with the outcome than those with large contribution rate. To overcome this drawback, a mCPC method was developed recently16. In this study, we explored theoretical powers of mCPC deeply and find out that the maximum power of mCPC depends on the maximum noncentral parameters of Chi-squared distributions for all PCs under the alternative hypothesis. However, it is difficult to obtain this information beforehand in practice. In view of this, we propose a novel and robust strategy to combine PCs. We also propose a test for genome-wide association studies and compare powers of this test to mCPC (k0.8) and some other existing procedures such as Hotelling’s T2 test (HT), oPC (k0.8) SSU and SKAT by extensive simulations. All simulation results show that our proposed procedure is more robust than mCPC, HT, oPC (k0.8), SSU and SKAT. Results of real data analysis further demonstrates good performances of our proposed test. We suggest researchers to employ our robust strategy when they consider using principal component analysis method in the future.

It should be noted that our proposed procedure is built upon test by Chatterjee et al.18 which was designed to detect association between a marker and a binary trait. It can be easily extended to other application fields. For instance, it has been used in pleiotropic genetic study to identify deleterious genetic variants associated with multiple traits16. In addition, our proposed test can also be used to detect the association between genetic variants and quantitative traits in framework of linear model, and ordinal traits on basis of proportional odds model. If quantitative traits do not follow normal distribution, one can consider constructing a multivariate nonparametric trend test22 and then employ our proposed strategy to combine them.

Methods

Maximum Powers of mCPC and ordinary PCA over extensive scenarios

For fixed m, the powers of mCPC (k) mainly depend on k, μ and Vm×m. We set different mean vectors under the alternative hypothesis among different covariance matrices Vm×m. We also consider two types of Vm×m: one is that m-dimension variables are uniformly correlated, which means covariance matrix Vm×m is a symmetry positively definite matrix with diagonal elements all being 1 and non-diagonal elements all being ρ; the other is that all correlations among these m variables are decreasing considering the “physical” distance (SNP location), which means Vm×m = (ρ|ij|)m×m. Without loss of generality, ρ is chosen to be 0.8 for strong linkage disequilibrium and 0.2 for weak linkage disequilibrium. Here, we consider m = 20. Note that, a test based on ordinary PCA can be gained, which is denoted by oPC (k) with , where . Obviously, powers of oPC (k) are also affected by k when are given. In order to view powers of mCPC (k) and oPC (k) comprehensively, we set α to be 0.05, and calculate powers of mCPC (k) and oPC (k) by numerical integration in R software under scenarios S1 to S16. All parameter settings about scenario S1 to S16 are displayed in Table 3. We calculate eigenvalues of Vm×m, ci, Ωi of all scenarios S1 to S16 for , and display all results in Tables 4 and 5. All power results of mCPC and ordinary PCA are displayed in Figs 1, 2, 3, 4. Under the same correlation structure and mean vector, the powers of mCPC (k) and oPC (k) are affected strongly by selection of k. From Tables 3, 4, 5 and Figs 1, 2, 3, 4, we can find that both the maximum powers of mCPC (k) and oPC (k) are related to the maximum non-centrality parameters of all PCs as k is from 1 to 20. For example, in Scenario S15, the non-centrality parameter of the second PC is the maximum among those of all PCs, and mCPC (2) has the maximum power. In another example, under the scenario S2, the non-centrality parameter of the third PC is the maximum, mCPC (4) has the maximum power, and powers of mCPC (3) and mCPC (4) are close. We can also find out that ordinary way to select k to construct mCPC (k) is not always desirable. For example, in Scenario S12, k0.8 = 5, but oPC (5) has power as low as 0.115. It is verified that mCPC is more desirable than ordinary PCA based on Figs 1, 2, 3, 4. It is also verified that the selection of k according to the cumulative contribution rate is not robust. One can just follow the common adoption and choose c = 80% or 90% for oPC (kc), but it will result in loss of power substantially under some situations. Furthermore, we draw a conclusion that mCPC (kc) performs more robust than oPC (kc) similar, since mCPC (kc) has reasonable powers over all the considered scenarios. For example, in Scenario S2, these 20 variables are uniformly correlated with ρ = 0.8 and , , the power of oPC (k0.8) is 0.073, which is far less than the power of mCPC (kc), which is 0.57.

Table 3 Parameter settings about means and covariance matrices.
Table 4 Eigenvalues, cumulative contribution rates and non-centrality parameters for Scenarios (S1) to (S8).
Table 5 Eigenvalues, cumulative contribution rates and non-centrality parameters for Scenarios (S9) to (S16).
Figure 1
figure 1

Powers of mCPC (k) and oPC (k) under significant level α = 0.05 for Scenarios (S1) to (S4).

Figure 2
figure 2

Powers of mCPC (k) and oPC (k) under significant level α = 0.05 for Scenarios (S5) to (S8).

Figure 3
figure 3

Powers of mCPC (k) and oPC (k) under significant level α = 0.05 for Scenarios (S9) to (S12).

Figure 4
figure 4

Powers of mCPC (k) and oPC (k) under significant level α = 0.05 for Scenarios (S3) to (S16).

A novel robust strategy to combine PCs

A further investigation of the maximum powers of mCPC (k) and oPC (k) shows that both of them are related to non-centrality parameters of the Chi-square distributions under the alternative hypothesis. For example, about scenario S16 in Table 5, the non-centrality parameters of all 20 PCs are 0.01, 0.05, 0.11, 0.19, 0.27, 0.37, 0.45, 0.53, 0.58, 0.61, 0.62, 0.60, 0.55, 0.48, 0.38, 0.30, 0.21, 0.12, 0.06 and 0.01 respectively, and non-centrality parameter being 0.62 which belongs to the 11th PC is the largest one among the non-centrality parameters of all 20 PCs. mCPC (10) takes the maximum power with 0.248 and the power of mCPC (11) is 0.247, which is very close to that of mCPC (10). The difference maybe are caused by numerical computing errors. The cumulative contribution rate of the top 10 PCs are 62.25%, which is much less than 80%. It is worth noted that the non-centrality parameters are determined by means and covariance matrix, which are hard to know in practice. Therefore, if we can know some prior information on means and covariance matrix, then the optimal strategy for selection of k become more prone to obtain. Aschard et al.17 proposed to use mCPC (k) with k being determined by cumulative contribution rate of 80%.

As shown above, using 80% cumulative contribution rate might not be a robust strategy, and it will give a very low power in some cases (e.g., Fig. 1). According to numerical results in Scenarios S1 to S16, we propose to use the tCPC method to combine all PCs.

Additional Information

How to cite this article: Li, Z. et al. Power Calculation of Multi-step Combined Principal Components with Applications to Genetic Association Studies. Sci. Rep. 6, 26243; doi: 10.1038/srep26243 (2016).