Joint Analysis of Multiple Phenotypes in Association Studies based on Cross-Validation Prediction Error

In genome-wide association studies (GWAS), joint analysis of multiple phenotypes could have increased statistical power over analyzing each phenotype individually to identify genetic variants that are associated with complex diseases. With this motivation, several statistical methods that jointly analyze multiple phenotypes have been developed, such as O’Brien’s method, Trait-based Association Test that uses Extended Simes procedure (TATES), multivariate analysis of variance (MANOVA), and joint model of multiple phenotypes (MultiPhen). However, the performance of these methods under a wide range of scenarios is not consistent: one test may be powerful in some situations, but not in the others. Thus, one challenge in joint analysis of multiple phenotypes is to construct a test that could maintain good performance across different scenarios. In this article, we develop a novel statistical method to test associations between a genetic variant and Multiple Phenotypes based on cross-validation Prediction Error (MultP-PE). Extensive simulations are conducted to evaluate the type I error rates and to compare the power performance of MultP-PE with various existing methods. The simulation studies show that MultP-PE controls type I error rates very well and has consistently higher power than the tests we compared in all simulation scenarios. We conclude with the recommendation for the use of MultP-PE for its good performance in association studies with multiple phenotypes.


Method
We consider a sample with n unrelated individuals. Each individual has K (potentially correlated) phenotypes and has been genotyped at a variant of interest. Let y ik denote the k th phenotype value of the i th individual and x i denote the genotype score of the i th individual, where x i ∈{0, 1, 2} is the number of minor alleles that the i th individual carries. We model the relationship between the multiple phenotypes and the genetic variant using an inverse linear regression model, in which the genotype at the variant of interest is the response variable and the multiple phenotypes are predictors. That is, We are not the first using an ordinal variable as response variable in a linear model. To correct for population stratification, Price et al. 16 used a qualitative phenotype or genotypes as response variables in linear models. To adjust the effects of covariates in rare variant association studies, Sha et al. 17 also used a qualitative phenotype or genotypes as response variables in linear models. To test the association between the K multiple phenotypes and the variant, we test the null hypothesis H 0 :β 1 = ··· = β K = 0 under model (1).
Let y i = (1,y i1 , …, y iK ) T and β = (β 0 , β 1 , …, β K ) T , then the regression model in equation (1)  , where Y = (y 1 , …, y n ) T and x = (x 1 , …, x n ) T . When multiple phenotypes are highly correlated, the rank of matrix Y may be less than K, then the inverse of Y T Y may not exist, which results in that the ordinary linear square estimate of β may not be unique 18 . Since multiple phenotypes in a GWAS are usually highly correlated, we propose to use Ridge regression [19][20][21][22][23][24] . Ridge regression penalizes the size of the regression coefficients. The Ridge regression estimator of β is defined as the value of β that minimizes where λ (λ ≥ 0) is a tuning parameter. The solution to the Ridge regression is given by Here the estimator of β depends on λ and we use the subscript λ to indicate that the estimator of β is a function of λ.
Based on Ridge regression, we propose to use the leave-one-out cross validation (LOOCV) prediction error under model (1) as a test statistic. Let λ − x i denote the LOOCV predicted value (leave the i th individual out) of x i under model (1) with parameter λ in Ridge regression. Then, the statistic can be written Note that T λ is the LOOCV prediction error, thus low values of T λ would imply significance. Let p λ denote the p-value of T λ (see next paragraph on how to calculate p λ ). We define the test statistic of Multiple Phenotypes based on Prediction Error (MultP-PE) as = .
λ λ − T p min (2) MultP PE We propose to use a grid search method in equation (2) to evaluate the minimization. We divide the interval To apply MultP-PE to GWAS with hundreds of thousands of SNPs, we also propose an algorithm that can perform the permutation procedure described above more efficiently in the following section.
A Fast Algorithm for the permutation procedure. We use the notations in the above section and let . We can show that the LOOCV prediction error in Ridge regression has a closed-form formula 24,26 Note that for two matrices or vectors A and B, we use A*B and A B to denote the element-wise operations; for a matrix C, we use colSum(C) to denote the sums of the columns of matrix C. We assume n ≥ K + 1. We perform singular value decomposition of Y, that is, Y = UDV, where U is an n × (K + 1) matrix with orthonormal columns, D is (K + 1) × (K + 1) diagonal matrix with non-negative real numbers on the diagonal, and V is an . Note that C, U, and H only depend on phenotypes and λ 1 , …, λ M . Thus, C, U, and H do not change in each permutation. For a GWAS, C, U, and H also do not change at different SNPs. For 1,000 permutations on one SNP, our fast algorithm is about 20 times faster than the original algorithm (the original algorithm calculates To perform a GWAS with hundreds of thousands of SNPs, we can use the same approach as was suggested in Zhu et al. 14 , that is, we can first select SNPs that show evidence of association based on a small number of permutations (e.g. 1,000), then use a large number of permutations to test the selected SNPs. For example, in our real data analysis with 630,860 SNPs, we first performed 1,000 Although we use a permutation procedure to calculate the p-value of MultP-PE, by using our fast algorithm, we can use less than one day to perform a typical GWAS. In our read data analysis on COPD in the following section, we performed a GWAS with 5,430 individuals across 630,860 SNPs and seven phenotypes. We completed the analysis in 10 hours on Intel Xeon E5-2680v3 by using a single node.
In the above section, we describe MulP-PE without considering covariates. If covariates are needed to be considered, we can incorporate covariates using the following approach in MultP-PE. Suppose that there are total G covariates we would like to consider and let (z i1 , …, z iG ) T denote the covariates for the i th individual. We can adjust esch of the phenotypes by the covariates by applying the linear regression model y a a z a z and use the residual of y ik to replace y ik in MultP-PE. In our real data analysis, we used this approach to incorporate four covariates. This approach has been used in the literature. For example, Sha et al. 16 and Zhu et al. 14 also used the same approach to adjust phenotypes for the covariates.
In association studies for unrelated individuals, it has been well known that population stratification can seriously confound association results 27 . There are several methods that have been developed to control for population stratification. For example, Genomic Control (GC) approach 28,29 , Principal Component (PC) approach 16,[30][31][32] , and Mixed Linear Model (MLM) approach 33,34 . Similar to most association tests for unrelated individuals, MulP-PE subjects to bias due to population stratification. To make MultP-PE robust to population stratification, we can use the PC approach. Let c i1 , …, c iL denote the top L PCs of the genotypes at a set of genomic markers for the i th individual. We can use the residuals of the regression model  In simulation studies, we evaluate type I error rates of MultP-PE by generating data sets with three different sample sizes, 500, 1,000 and 2,000. For power comparison, we compare the powers of different methods by simulation data sets with 1,000 unrelated individuals. For genotype data, we generate genotype at a genetic variant according to minor allele frequency (MAF) and assume Hardy-Weinberg Equilibrium (HWE). For each individual, we generate K phenotypes using models similar to the models used in Zhu et al. 14 and Wang et al. 35 . The K phenotypes are generated from the following model where y = (y 1 , …, y K ) T ; φ = (φ 1 , …, φ K ) are the genetic effects of the variant on the K phenotypes; x is the genotypic score at the variant; c is a constant number; γ is a K × R matrix; ω = (ω 1 , …, ω R ) T is a vector of factors with R elements and ω ω ω , ρ is the correlation between factors, A is a matrix with elements of 1, and I is the identity matrix; ε = (ε 1 , …, ε K ) T is a vector of residuals, ε 1 , …, ε K are independent, and ε ∼ N(0, 1) k for k = 1, …, K. Based on equation (4), we consider the following four models in which the within-factor correlation is c 2 and the between-factor correlation is ρc 2 .

Model 1.
There is only one factor and genotypes impact on all phenotypes with different effect sizes. That is,   , and γ = Bdiag(D 1 , . For the type I error rates, we set β = 0 to indicate that the genetic variant has no effect on all phenotypes. For power comparisons, we consider different values of β. To evaluate type I error rate and power, we set MAF = 0.3, the between-factor correlation is 0.14, and the within-factor correlation is 0.25. In the following simulation studies and real data analysis, we use eight different values of λ (M = 8) and set λ = .

Results
To evaluate the type I error rates of MultP-PE, we consider different significance levels (0.01 and 0.05), different sample sizes (500, 1000 and 2000), and different number of phenotypes (10, 20 and 40). We use 1,000 permutations to calculate the p-values of MultP-PE and use 10,000 replicated samples to evaluate type I error rates of MultP-PE. For 10,000 replicated samples, the 95% confidence intervals (CIs) for the estimated type I error rates with nominal levels 0.05 and 0.01 are (0.04562, 0.05438) and (0.00804, 0.01196), respectively. We summarize the estimated type I error rates of the proposed test in Table 1. This table shows that only one type I error rate is not in the corresponding 95% CI (it is very close to the upper-bound of the CI), which indicates that the proposed method is valid.
In power comparisons, we calculate the p-values of MultP-PE using 1,000 permutations and the p-values of MultiPhen, OW, TATES, MANOVA, OB using their asymptotic distributions. We evaluate the powers of all of the six tests using 1,000 replicated samples at a significance level of 0.05. Figures 1 and 2 show the powers of the six methods as a function of the effect size β with K = 20 and 40, respectively. As shown in these two figures: (1) MultP-PE is the most powerful test. The power of MultP-PE is much higher than the second most powerful test; (2) as the effect size β increases, the powers of all tests increase as well; as the number of phenotypes K increases from 20 to 40, MultP-PE presents more ascendancy than the other five tests; (3) MultiPhen, OW, and MANOVA have similar powers under all four models. A similar conclusion has been reached in some published papers 2,6,7 ; (4) OB is comparable to MultiPhen, OW, and MANOVA in models 1 and 2, but has almost no power when the genetic effects have different directions (models 3 and 4); (5) TATES is more powerful than MultiPhen, OW, and MANOVA in model 2, but is less powerful than MultiPhen, OW, and MANOVA in models 3 and 4.
Power comparisons of the six methods as a function of the within-factor correlation, c 2 , with K = 20 and 40 are given in Figs 3 and 4, respectively. As shown in these two figures: (1) the patterns of the power performance are similar to those in Figs 1 and 2; (2) when the within-factor correlation is increasing, the powers of all six tests  have increasing trend or decreasing trend depending on different model settings. This pattern has been confirmed in Zhu's paper 6 ; (3) OB is the least powerful test except under model 2 with the within-factor correlation > 0.2. Power comparisons of the six methods as a function of the between-factor correlation, c 2 ρ, with K = 20 and 40 are given in Figs S1 and S2, respectively. As shown in these two figures: (1) the patterns of the power performance are similar to those in Figs 1 and 2; (2) when the between-factor correlation is increasing, the powers of all six tests have increasing trend except for these under model 1; (3) MultP-PE is the most powerful test, while OB is the least powerful test except under model 2 with the between-factor correlation = 0.1.
In summary, MultP-PE is consistently the most powerful test among the tests we compared under all simulation scenarios.

Real Data Analysis
Chronic obstructive pulmonary disease (COPD) is a terminology to describe progressive life-threatening lung diseases that causes breathlessness and serious illness, including emphysema, chronic bronchitis, refractory asthma, and some forms of bronchiectasis. A global prevalence of 251 million cases of COPD is reported in 2016 and it is estimated that COPD caused 3.17 million deaths in 2015 36 . The COPDGene aims to find inherited or genetic factors that associated with COPD. The COPDGene dataset includes 10,192 participants, 3,408 of them are African-Americans (AA), and 6,784 of them are Non-Hispanic Whites (NHW). Same as Liang et al. 37 , we considered Age, Sex, BMI, and Pack-Years as four covariates and selected seven quantitative COPD-related phenotypes (FEV1, Emphysema, Emphysema Distribution, Gas Trapping, Airway Wall Area, Exacerbation frequency, and Six-minute walk distance) in the following data analysis.
We deleted individuals and genotypes with missing data. After excluding missing data, a set of 5,430 NHW across 630,860 SNPs was used in the analysis. Then we adjusted the phenotypes for the covariates by applying a linear regression 14,17 . We regressed each phenotype on the four covariates, replaced original phenotypes with the residuals of the regression, and applied each of the six tests to detect the association between the covariates-adjusted phenotypes (residuals) and each SNP.
We used genome-wide significance level 5 × 10 −8 to identify SNPs that are significantly associated with the seven COPD-related phenotypes. There were total 14 SNPs identified by at least one method ( Table 2). All of the 14 SNPs had been reported to be associated with COPD by previous studies [38][39][40][41][42][43][44][45][46][47][48][49][50] . As shown in Table 2, MultiPhen identified 14 SNPs; OW, MANOVA, and MultP-PE identified 13 SNPs; TATES identified 9 SNPs; and OB did not identify any SNPs. The number of SNPs identified by MultP-PE was comparable to the largest number of SNPs identified by other tests and the COPD analysis results were consistent with our simulation results. We also performed individual phenotype analysis on each of the seven phenotypes. Table S1 gives the adjusted p-values (Bonferroni correction for multiple testing) to test each of the seven phenotypes on the 14 significant SNPs. We can see from Table S1, among the 14 SNPs, only nine SNPs are significantly associated with Emphysema Distribution at the genome-wide significance level. The number of SNPs identified by individual phenotype is the same as TATES and is less than the number of SNPs identified by four multiple phenotype analyses (OW, MANOVA, Multiphen, and MultP-PE), which showed that the simultaneous analysis of multiple phenotypes can increase power comparing with single phenotype analysis.

Discussion
For complex diseases in GWAS, the association between a genetic variant and each phenotype is usually weak. Analyzing multiple disease-related phenotypes could increase statistical power to identify the association between genetic variants and complex diseases. In this article, we developed a novel statistical method, MultP-PE, to test the association between a genetic variant and multiple phenotypes based on cross-validation prediction error. We showed that MultP-PE controls type I error rates very well and has consistently higher power than other methods we compared among all the simulation scenarios. Overall, MultP-PE is the most powerful test and has much higher power than the second most powerful test; OW, MANOVA, and MultiPhen have very similar performance; OB loses power dramatically when genetic effects have opposite directions on phenotypes; TATES is more powerful when the genetic effect only works on a portion of phenotypes. In real data analysis, MultP-PE identified 13 out of 14 significant SNPs, which is comparable to MultiPhen (14 out of 14).