Nonparametric Risk and Nonparametric Odds in Quantitative Genetic Association Studies

The coefficient in a linear regression model is commonly employed to evaluate the genetic effect of a single nucleotide polymorphism associated with a quantitative trait under the assumption that the trait value follows a normal distribution or is appropriately normally distributed after a certain transformation. When this assumption is violated, the distribution-free tests are preferred. In this work, we propose the nonparametric risk (NR) and nonparametric odds (NO), obtain the asymptotic normal distribution of estimated NR and then construct the confidence intervals. We also define the genetic models using NR, construct the test statistic under a given genetic model and a robust test, which are free of the genetic uncertainty. Simulation studies show that the proposed confidence intervals have satisfactory cover probabilities and the proposed test can control the type I error rates and is more powerful than the exiting ones under most of the considered scenarios. Application to gene of PTPN22 and genomic region of 6p21.33 from the Genetic Analysis Workshop 16 for association with the anticyclic citrullinated protein antibody further show their performances.

In binary-trait genetic association studies, the relative risk (RR) is commonly employed to show the degree of risk of the genetic variants associated with human complex diseases. The RR has two important features that it compares the probability of an event occurring in an exposed group to that of the event occurring in a non-exposed group and the reference group ensures the proper comparison. As an alternative, the coefficient in a linear regression model can be reported to show the association strength between the genetic variant and the quantitative trait. When the trait values follow the normal distribution or are normally distributed after a certain transformation such as the Box-Cox transformation, the corresponding statistical analyses are valid. However, the assumption of normality is often violated in practice. For example, for the anticyclic citrullinated protein antibody (anti-CCP) to be analyzed later, neither the observed values nor their logarithm transformation versions satisfy the normal assumption due to truncation. Sometimes, even though some transformations can be used, different transformations might result in different conclusions.
The Kruskal-Wallis test 1 , the Jonckheere-Terpstra test 2,3 , the U-statistics-based tests 4,5 , and the nonparametric trend test 6 can be used to evaluate the association between the genetic variants and the quantitative traits for the non-normal trait values. The distributions or the approximate distributions of the test statistics are derived under the null hypothesis that the genotypes are not associated with the traits. However, there is no genetic effect size defined. Recently, Konietschke et al. 7  size to measure the genetic effect, where X j denotes a random variable taking the phenotype values corresponding to genotype j, j = 0, 1, 2. Based on θ j , they constructed the maximal test considering three genetic models: recessive, additive and dominant models. Also, Brunner and Munzel 8 defined it as relative treatment effect in solving Behrens-fisher problem and Ryu 9 called it ordinal effect size measure for the ordered categorical data. However, they did not study the performances of the estimators for θ j and their confidence intervals, especially under the alternative hypothesis that the genetic variant is associated with the trait, which is of great interest to investigators when a positive finding has been found.
In this paper, we define the nonparametric odds (NO) as where Y i denote the trait value that the subjects taking in the group with genotype i, i = 1, 2. Compared to θ j , j = 0, 1, 2, NO has three salient features. First, it gives a comparison of the difference between two groups; the second is that the reference group with genotype value 0 can guarantee a reasonable comparison; third, it equals 1 under the null hypothesis. This paper is organized as follows. In the "Results" section, simulations and real data analysis are conducted to illustrate the performances of the proposed procedures. Some further topics and issues are present in the "Discussion" section. In the "Methods" section, we describe the NO, give its point and confidence interval estimates, and construct the test statistic for a given genetic model and a robust test, which is free of genetic models. At last some technical details are given in the Supplementary Material.

Results
Simulation Settings. We conduct simulation studies to explore the performances of the proposed procedures. Consider the linear model denotes the genotype value at a single nucleotide polymorphism (SNP) locus with ∈ , , g {0 1 2} i being the count of a certain allele, and ε ε ε ε = ( , , , ) τ  n 1 2 denotes the random error with ϵ i i.i.d. following a truncated generalized extreme value distribution (a heavy-tailed distribution), tGEV(0,0,1,0) with the shape parameter 0, the location parameter 0, the scale parameter 1 and the truncated point 0. We consider β 0 = 0.5, β ∈ , . , .
{0 0 25 0 50} 1 and three minor allele frequencies (MAFs) with 0.15, 0.30 and 0.45. We consider three sample sizes with 500, 1,000 and 1,500, where the results for the sample size of 500 and 1,000 are shown in the Supplemental Material. 2,000 replicates are conducted to calculate the character statistics for each scenario.
Biases, mean squared error and confidence Interval. Table 1 shows the empirical bias, the square root of mean square error (sMSE), the cover probability (CP) and the interval length (IL) for β ∈ . , .
{0 25 0 5} 1 and MAFs ∈ . , . , .    For MAF of 0.05, KLH is too optimistic with the empirical type I error rates 0.174, which is by far larger than 0.05. We then consider a stringent nominal level of 5 × 10 −4 . Similar results are observed, see Table  S19 for details in the Supplementary Material. For the F, Jonckheere-Terpstra and KLH procedures, We use the existing R-software package to calculate the corresponding p-values. Specifically, the function lm in the package stats to compute the statistical significance of the F test, the function jonckheere.test in the package clinfun to compute that of the Jonckheere-Terpstra test, and the function nparcomp in the package nparcomp to compute that of the KLH test.

Power Comparison.
We conduct simulation studies to compare the power among the above tests. We use the same simulation settings as above and set β 0 = 0.50, β ∈ , . , .  Applications to the gene PTPN22 and the genomic region of 6p21.33. It is well known that the genetic variants are deleterious to rheumatoid arthritis (RA) 10 . The anti-CCP taking continuous values, are much more common in the blood of individuals with RA than those without it 11 . The specificity using the anti-CPP to diagnose RA lies between 87.8% and 96.4% 12 . The gene PTPN22 and genomic region of 6p21.33 were reported to be associated with RA 13,14 . We apply the proposed procedures to the gene PTPN22 data including 25 SNPs and the genomic region of 6p21.33 including 45 SNPs from the Genetic Analysis Workshop 16 15 . The data consists of 868 cases and 1,194 controls, where 868 subjects have the anti-CCP measure and 1,194 do not have them. The minimum value of anti-CCP for these 868 individuals is 20.053. As shown in Li et al. 6 , neither the original values of anti-CCP (p-value using Shapiro-Wilk test is 10 −16 ) nor their logarithm transformation ones (p-value is 3.9 × 10 −9 ) follow the normal distribution. It is most likely that the observed anti-CCP measurements come from a truncated distribution since the anti-CCP is often measured to confirm RA and so it is not measured in controls. So the statistical analysis under the normal assumption is not reasonable.
Here we only use the genotype data in cases since the anti-CCP values are missing in controls. Table  S20 in the Supplemental Material shows the p-values of 25 SNPs in gene PTPN22 and Table S21 shows the p-values of 45 SNPs in the genomic region of 6p21.22 using the KW-A, F-A, Z A , and MAX3. To evaluate the accuracy of the derived formulas, we also include 100,000 permutations to calculate the p-value of MAX3 in the table. The p-value results using permutation is very close to that derived from the asymptotic distribution. We find that the minimum p-value for 25 SNPs is 0.018 using MAX3. Under the significance level of 0.05, and after the Bonferroni correction, no significance reveals. The reason might be that we do not use the data whose anti-CCP measures are less than 20.053. For the genomic region of 6p21.22, there is one SNP rs2734583, which is significant under the significance level of 0.05, and the Bonferroni correction. The p-values using KW-A, F-A, Z A and MAX, are 0.0004, 0.0009, 0.0014, and 0.0008, respectively.    In this work, we define the NR and NO, and give their point and confidence interval estimates under the general framework. Then we define the recessive, additive and dominant genetic models using NO and propose the corresponding point and confidence interval estimates. We also provide three test statistics under different genetic models and a robust test being free of genetic model. Extensive simulation studies shows that the cover probabilities of the proposed procedures are satisfactory, because they are close to the nominal level, and the proposed test is powerful than the exiting procedures under a given genetic model. Applications to GAW16 data for anti-CCP further show the performances of the proposed procedures.

Figure 1. The empirical power of the Kruskal-Wallis test (KW-R, KW-A and KW-D), the Jonckheere-Terpstra test (JT-R, JT-A and JT-D), the F test (F-R, F-A and F-D) and the proposed nonparametric test (Z R , Z A and Z D ) derived under a given genetic model.
The first column is for β 1 = 0.25 and the second column is for β 1 = 0.5.
There are several issues that can be further investigated. First in this work we study only the scenario for the common allele under the hypothesis of common disease common allele. Under the hypothesis of common disease rare allele, how to extend the proposed procedure to this situation is worth considering, where the large sample theory does not hold. Secondly, we focus on the biallelic marker, the proposed procedures can be readily extended to multiallelic locus, and we can similarly define the NR and NO and construct the confidence intervals of NO. Thirdly, it is well known that family-based association studies is less susceptible to the confounding factors than population-based studies, so a trio design with two parents and an offspring could be considered in the future.
Confounding factors can not be ignored and adjusting for covariates is an important issue in population-based genetic association studies. For example, The hidden population structure can lead to many false-positive findings and correcting for population stratification has become a routine in genome-wide association studies (GWASs) 16,17 . To remove the effect of the covariates, we can use the same strategies as EIGENSTRAT 16 . We first compute the residuals of the trait after regressing out the covariates and then take the residuals as the new outcome rather than the original trait values. Simulation results shown in the Supplemental Material are consistent with those without considering the covariates.
In the proposed NR and NO, we do not consider the ties among the observations because the probability of the event that there are ties in the observations for the phenotype taking continuous values is zero. However, if there are ties among the sample, we can add to the formulas of NR and NO, and 0.5 × (number of ties between group with genotype i 1 and group with genotype i 2 )/( ) n n i i 1 2 to the estimators. To calculate λ 1A under the additive model, the linear combination of λ 1 and λ 12 is used. Here, we adopt the proportion of sample size for each NO estimate over the total sample sizes as the weights. There are also some other available weights. For example, the MSE-based weights proposed by Zhong and Prentice 18 to reduce the bias of log-odd ratios in a two-stage GWAS, and the square root of the sample proportion given by 19 to jointly analyze the test statistics in a two-stage GWAS. So a comparison of the efficiency of these weights is a valued topic.
In the Supplementary Material, we prove that both , = , f i 1 2 i and ξ follow the standard normal distribution as min{n 0 ,n 1 ,n 2 } goes to infinity. However, how many samples we need to get an accurate estimation? Based on the simulations, 1,000 more sample size is needed to get a reasonable estimate. In the current GWASs, the sample size is tens of thousands. For example, the Wellcome Trust Case Control Consortium (2007) 20 used about 2,000 cases for each of seven diseases and 3,000 shared controls to detect the deleterious SNPs. Another example was Landi et al.'s (2009) 21 Lung GWAS, where 5,939 cases and 5,848 controls were genotyped to identify the hereditary contribution to adenocarcinoma.
In the simulations, we consider the situation where the MAF is greater than 0.1, the sample size is larger than 1,000 and the distribution is truncated generalized extreme value distribution. Actually, we also conduct the simulation studies considering that MAF is 0.05, the sample size is 1000 and 500, and the distribution is normal distribution and centralized t distribution. Results given in the Supplemental Material indicate that the proposed test is more powerful than the existing ones under most of the considered scenarios.
The proposed NR and NO can estimate the non-parametric disease risk and odds in studies with "prospective sampling" such as a cohort study. However, in case-control studies, the case/control ratio for recruited subjects is different from the true one in the population. The derived tests can still be used to detect the association between the quantitative trait and genetic variants in case-control association studies. Finally, the proposed procedure has been coded in R verion 3.1.1 and is freely requested from the corresponding author.

NR and NO.
Suppose that n subjects are enrolled from a source population in a quantitative trait genetic association study. A biallelic SNP is considered here. Assume that the genotype is coded as 0, 1, and 2, corresponding to the count of a certain allele. Let ( , ), = , , ,  y g i n 1 2 i i be the observed sample, where y i and g i denote the trait value and genotype value of the ith subject. For the convenience of notations, let the first n 0 subjects have the genotype 0, the second n 1 subjects have the genotype 1, and the last n 2 subjects possess the genotype 2. Suppose that the group with genotype 0 is the reference group. , respectively. Here we do not consider that there exit ties in the sample because the probability of the ties between two genotype groups is zero when the random variables Y 0 ,Y 1 and Y 2 are continuous. f i shows the probability that the trait values in a group with genotype i is stochastically larger than those in the reference group with gentoype 0, i = 1,2. We call f 1 and f 2 the nonparametric risk (NR). Then we define the NO as λ i = f i /(1 − f i ), i = 1, 2. We point out that when any of the f 1 or f 2 is one, the estimator for λ 1 or λ 2 is not defined, although this is unlikely to occur in practice. The null hypothesis is H 0 : f 1 = f 2 = 0.5 or λ 1 = λ 2 = 1. The alternative hypothesis is H 1 : f 2 ≥ f 1 ≥ 0.5 and f 2 > 0.5 or equivalently, λ 2 ≥ λ 1 ≥ 1 and λ 2 > 1.
The empirical estimators of f 1 and f 2 are, respectively, Besides the Standard Interval, we can also construct the Wilson Interval 22,23 . f i can be thought to estimate the success probability of a binomial distribution with two parameters (N i , f i ) where N i represent the number of trials and f i is the success probability. We estimate the efficient N i by setting Then the number of successes is