A Nonparametric Regression Approach to Control for Population Stratification in Rare Variant Association Studies

Recently, there is increasing interest to detect associations between rare variants and complex traits. Rare variant association studies usually need large sample sizes due to the rarity of the variants, and large sample sizes typically require combining information from different geographic locations within and across countries. Although several statistical methods have been developed to control for population stratification in common variant association studies, these methods are not necessarily controlling for population stratification in rare variant association studies. Thus, new statistical methods that can control for population stratification in rare variant association studies are needed. In this article, we propose a principal component based nonparametric regression (PC-nonp) approach to control for population stratification in rare variant association studies. Our simulations show that the proposed PC-nonp can control for population stratification well in all scenarios, while existing methods cannot control for population stratification at least in some scenarios. Simulations also show that PC-nonp’s robustness to population stratification will not reduce power. Furthermore, we illustrate our proposed method by using whole genome sequencing data from genetic analysis workshop 18 (GAW18).

the inflation factor λ can be estimated using genotypes at genomic markers. PC-linear approach summarizes the genetic background or ancestry information through the PCs of genotypes at genomic markers. The PCs can be further used to eliminate the effect resulting from population stratification through linear regressions. MLM approach corrects for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals.
Although several methods for controlling for population stratification have been developed for common variants, it remains unclear whether these methods are equally effective for rare variants. Because rare variants have typically arisen recently, they tend to show greater geographic clustering or more latent subpopulations than common variants that are typically older. The more geographic clusters or latent subpopulations, the more difficult it will be to control for population stratification. Mathieson and McVean 23 demonstrated that rare variants can show a stratification that is systematically different from common variants. They also demonstrated that the commonly used methods such as GC, PC-linear, and MLM to control for population stratification in common variant associations are not necessarily controlling for population stratification in rare variant associations. Zhang et al. 24 showed that the use of PCs calculated from common variants were effective to control for population stratification in rare variant associations. Jiang et al. 25 also found that the PC based methods performed quite well while GC often yielded lower power. Note that both studies of Zhang et al. 24 and Jiang et al. 25 did not explicitly model the spatial structure of populations in their simulation studies. Zhang et al. 24 used two continental groups from the 1000 Genomes Project with six and four subpopulation groups, respectively. Jiang et al. 25 simulated data with two populations. Lissgarten et al. 26 reported that FaST-LMM Select (a MLM approach) could control for population stratification when samples were from spatially structured populations. However, their approach reduced power substantially when causal rare variants are spatially clustered 26,27 .
In this article, we propose a PC based nonparametric regression (PC-nonp) approach to control for population stratification in rare variant association studies. PC-nonp adjusts population effects of both trait values and genotypes at candidate loci for PCs of genotypes at genomic markers by applying nonparametric regressions. We use extensive simulation studies to evaluate the performance of the proposed method PC-nonp and compare the performance of PC-nonp with that of GC and PC-linear developed for common variants and recently proposed biased urn permutation test (BiasePerm) 28 developed for rare variants. Simulation results show that PC-nonp can control for population stratification well in all scenarios while GC, PC-linear, and BiasedPerm cannot control for population stratification at least in some scenarios. Results also show that PC-nonp's robustness to population stratification will not reduce power. Furthermore, we evaluate the performance of our approach by applying it to the whole genome sequencing data from genetic analysis workshop 18 (GAW18) and find that only PC-nonp is effective to control for population stratification.

Method
Consider a sample of n unrelated individuals. Suppose that each individual has been genotyped at a candidate locus (single variant or multiple variants) and at L genomic markers. Let y i , x i , and p i denote the trait value, genotypic score at the candidate locus (weighted sum of genotypic scores if there are multiple variants), and the first k PCs (rescaled to the interval [0, 1]) of genotypes at genomic markers of the i th individual. The PCs of genotypes at genomic markers are good summary measures of ancestry or genetic background. PC-linear is probably the most popular method to control for population stratification. However, this method is based on linear combinations of PCs. Furthermore, recently developed BiasePerm 28 is based on linear combinations of PCs on logistic scale if we use PCs as covariate vector 28 . The relationships between trait values and PCs can be highly nonlinear and population effects cannot be corrected by simply using linear functions 23 . Figure 1 shows the relationships between trait values and the first two PCs of genotypes at 10,000 genomic markers in two structured populations. This figure shows that the relationships between trait values and PCs are highly nonlinear and the forms of the relationships are different in different populations. When the relationships are highly nonlinear and the forms of relationships are unknown, we should use more flexible regression methods rather than use linear regression. Nonparametric regression is a very flexible regression method and it does not require the form of regression function.
In this article, we propose a PC based nonparametric regression (PC-nonp) approach that adjusts population effects of both trait values and genotypes at candidate loci for PCs of genotypes at genomic markers by applying nonparametric regressions. That is, where μ 1 (·) and μ 2 (·) are regression functions with unknown forms and will be estimated using smoothing techniques. Let ⁎ y i and ⁎ x i be the residuals of the nonparametric regressions. We can consider ⁎ y i and ⁎ x i as the trait value and genotypic score at the candidate locus of the i th individual after adjusting for population effects. We can construct association tests based on the residuals.
Many methods have been developed to estimate the unknown regression function, including local linear method [29][30][31] , kernel smoothing method 32,33 and wavelet method 34,35 . We propose to use kernel smoothing method. Let K(·) be a kernel function with mode at 0. The kernel estimators of μ 1 (p i ) and μ 2 (p i ) are given by are the weighted mean of trait values and weighted mean of genotypic scores of those individuals whose genetic background is similar to that of the i th individual. Thus, we can consider residuals = −⁎ y y y i i i and = −⁎ x x x i i i as the trait value and genotypic score of the i th individual after adjusting for population stratification.
In this study, we use the quartic kernel 33 , For computational consideration, we assume that h 1 = ... = h k = h. Then, To test association between trait values and genotypes based on ⁎ y i and ⁎ x i , we can use score test with test statistic The statistic T score asymptotically follows a chi-square distribution with one degree of freedom (df) 36 . For rare variants, x i can be a weighted combination 2 or collapsing 1,3 of genotypes at multiple variants in a genomic region. Based on the residuals of the nonparametric regression, we can construct other rare variant association tests such as CMC 1 , SKAT 9 , and TOW 8 . We will discuss this issue in more details later in the discussion section. In this study, we use a single-variant test in which x i is the genotypic score of a single variant and a regional test in which x i is the weighted combination of genotypes at the variants in a genomic region 2 to evaluate the performance of our proposed method. We have so far assumed a given smoothing parameter in the kernel estimates. It is well known that choosing a proper value for smoothing parameter h is critical to kernel estimates of regression functions 32,37 . We use a method similar to that of Zhang et al. 35 to choose smoothing parameter h. This method is based on the genotypes at a set of genomic markers. Suppose there are L genomic markers. We perform PC-nonp single-variant test for all the L genomic markers and denote P 1 , … , P L as the associated P-values. If population stratification is well controlled for, P-values P 1 , … , P L should follow a uniform distribution under the null hypothesis of no association. Let F n be the empirical distribution function of the P-values P 1 , … , P L and F be the uniform distribution function.
measures how close the distribution of the P-values P 1 , … , P L and the uniform distribution are. We propose to choose h * that minimizes the Kolmogorov test statistic, i.e., h as the value of the smoothing parameter. h * can be obtained by a simple grid search across a range of h. We divide The computational time to find h* increases linearly with S. However, h* needs to be calculated only once. We can use this h* to calculate the residuals of the nonparametric regression for trait values and genotypes at each variant. Let k denote the number of PCs used. In this study, we use h s = 2 2(s−23)/(5+k) , where s = 1, … , 30 and k = 10. It is worth noting that the smoothing parameter h is chosen with the P-values of a single-variant test, whichever test is actually used in testing associations.
Software. R code for implementing our proposed method is given at Shuanglin Zhang's homepage http:// www.math.mtu.edu/~shuzhang/software.html. The R code includes three functions: PCA, choose_OPT_SMP, and Resid_Nonp. PCA gives the first k principal components of genotypes at genomic markers. choose_OPT_ SMP chooses the optimal value of smoothing parameter. Given the value of the smoothing parameter, Resid_ Nonp calculates the residuals of trait values and genotypes at a candidate region by applying nonparametric regression for PCs of genotypes at genomic markers.

Comparison of Tests.
We compare the performance of the proposed test with that of the following four tests. (1) Uncorrected: this test is also based on the score test statistic T score U . T score U is the same as T score but T score U is based on the original trait values y i and genotypic scores x i instead of based on the residuals. (2) GC 17 : GC divides T score U by an inflation factor λ and λ = ... 20 : this test is the same as PC-nonp but PC-linear is based on the residuals of linear regression instead of based on the residuals of nonparametric regression. (4) The biased urn permutation test (BiasedPerm) 28 : in this permutation procedure, the odds of a subject being selected as a case are equal to his or her odds of disease conditional on confounder variables. In this study, PC-linear, PC-nonp, and BiasedPerm are based on the first 10 PCs of genotypes at the genomic markers.
Simulations. We consider two sets of simulations: populations with k 0 subpopulations and populations with spatially structured populations. In each set of simulations, we consider both qualitative and quantitative traits. To generate a qualitative disease affection status, we use a liability threshold model based on a continuous phenotype (quantitative trait). An individual is defined to be affected if the individual's phenotype is at least one standard deviation larger than the phenotypic mean. This yields a prevalence of 16% for the simulated disease in the general population. In the following, we describe how to generate genotypes and how to generate a quantitative trait in the two sets of simulations.
Simulation Set 1: Populations with k 0 Subpopulations. This set of simulations is based on allele frequencies at 24,487 variants calculated from the empirical Mini-Exome genotype data provided by the genetic analysis workshop 17 (GAW17). The genotypes of GAW17 data set are extracted from the sequence alignment files provided by the 1000 Genomes Project for their pilot3 study (http://www.1000genomes.org). GAW17 data contain genotypes of 697 unrelated individuals at 24,487 variants. The distributions of MAF at rare variants (MAF < 0.01) and MAF at common variants of 24,487 variants are given in Figure S1.
To generate genotypes of individuals in a population with k 0 subpopulations, we follow Price et al. 20 , Ionita-Laza et al. 38 , and Qin et al. 39 . For each variant, we randomly select a variant from 24,487 variants and take the MAF at this variant as the ancestral population allele frequency p. Then, independently draw k 0 values ... otherwise. The MAF distributions at the rare variants (MAF < 0.01) and at the common variants for k 0 = 5 are given in Figure S1.
To evaluate type I error, we generate trait values independent of genotypes by using the model: To evaluate power, we consider n T variants (possibly both rare and common variants) in a genomic region. We randomly choose n c from the n T variants as causal variants (in this study, n c = n T /2). For the j th individual in the i th subpopulation, let x ijl denote the genotypic score of the j th individual in the i th subpopulation at the l th causal variant. We assume that all the n c causal variants have the same heritability such that rarer variants have larger effects. Under this assumption, the disease model is given by To generate quantitative traits under null hypothesis, let φ: |1, n| → |1, K 0 | × |1, K 0 | be a function that maps each individual to the grid square from which they originated. Then, we generate the trait value of the i th individual by (l, j) if the i th individual originates from grid square l, j; R l,j is the nongenetic risk in grid square l, j; ε i is a standard normal random number; and β is a constant. We use the following three models to determine the value of R l,j . Model 0: no population stratification in which R l,j = 0 for all l and j. Model 1: a small and sharp spatial distribution in which R l,j = 1 if l 0 ≤ l ≤ l 0 + 3 and j 0 ≤ j ≤ j 0 + 3 for l 0 = j 0 = 6, or 20 − l 0 = j 0 = 6, or l 0 = j 0 = 14; R l,j = 0 otherwise. Model 2: a wide and smooth spatial distribution in which Under alternative hypothesis, we assume that there are n T variants in a genomic region. We randomly choose n c from the n T variants as causal variants. For an individual, let x l denote the genotypic score at the l th causal variant. Under the assumption that all the n c causal variants have the same heritability, the trait value for an individual is generated by  To evaluate type I error rates, we first want to see the performance of the asymptotic distributions we used. For this purpose, we perform simulations under null hypothesis in a homogenous population (k 0 = 1 in simulation set 1) and in the case of no population stratification (model 0 in simulation set 2). Type I error rates are given in Tables 1 and 2 for quantitative traits and qualitative traits, respectively. Table 1 shows that, for quantitative traits, type I error rates of all the four tests in all the scenarios are within the corresponding 95% confidence intervals, which indicates that the asymptotic distributions work very well. Table 2 shows that, for qualitative traits, most of the type I error rates are within the corresponding 95% CIs and those of the type I error rates that are not in the 95% CIs are very close to the corresponding 95% CIs, which indicates that the asymptotic distributions approximately work well.

Existence of the minimum of Kolmogorov test statistic Kol(h).
Type I error rates under structured populations in simulation set 1 for k 0 = 2, 10, 20 are given in Tables 3 and 4 for quantitative traits and qualitative traits, respectively. As shown by these two tables, Uncorrected has inflated type I error rates in all the scenarios. GC cannot control for population stratification for quantitative traits when k 0 = 10 and 20 because most variants have very small correlation with the trait. PC-linear and BiasedPerm cannot control for population stratification when k 0 = 20 because the linear combinations of the first 10 PCs cannot discriminate 20 subpopulations. Only PC-nonp can control for population stratification in all simulation scenarios. If we increase the number of PCs, PC-linear and BiasedPerm may control for population stratification when k 0 = 20. The problems to use PC-linear and BiasedPerm to control for population stratification are (1) we do not know how many PCs should be used and (2) increasing the number of PCs may decrease the power.
Type I error rates under spatially structured populations in simulation set 2 for models 1 and 2 are given in Tables 5 and 6 for quantitative traits and qualitative traits, respectively. These two tables show that Uncorrected has inflated type I error rates in all the scenarios. GC cannot control for population stratification for single variant test because most variants have very small correlation with the trait. PC-linear and BiasedPerm have inflated type I error rates under model 1 because these two methods try to correct highly nonlinear relationships on the basis of linear functions of relatedness. PC-nonp can control for population stratification well in all simulation scenarios because nonparametric regressions can adapt any function, linear or nonlinear.

Power comparison.
To evaluate if PC-nonp's robustness to population stratification will reduce power, we perform simulation studies to compare power using regional tests under k 0 = 1 and k 0 = 10 in simulation set 1  and under models 0 and 2 in simulation set 2, in which all tests except Uncorrected can control for population stratification well. Power comparisons under k 0 = 1 and k 0 = 10 in simulation set 1 are given in Fig. 3. This figure shows that, when there is no population stratification (a homogenous population), all tests have very similar powers. When there is population stratification (a structured population with 10 subpopulations), PC-nonp and PC-linear are more powerful than Uncorrected and BiasedPerm, and GC has the lowest power. GC loses power because it has a larger inflation factor when there is population stratification. BiasedPerm essentially performs permutation within subpopulations and thus it will lose power when there are a large number of subpopulations. Uncorrected loses power because, in the structured population with 10 subpopulations, different trait value means in subpopulations weaken the association signal. PC-nonp and PC-linear do not lose power because, after adjusted for population effects, it appears that PC-nonp and PC-linear perform association tests in a homogenous population.  Table 2. Type I error rates of four tests in a homogenous population and in the case of no population stratification for qualitative traits. Note: "A homogenous population" means k 0 = 1 in simulation set 1; "no population stratification" means model 0 in simulation set 2.   Table 6. Type I error rates of five tests based on simulation set 2 for qualitative traits. Analysis of GAW18 whole genome sequencing data set. The data set for GAW18 includes whole genome sequencing (WGS) data of 959 individuals (464 directly sequenced and the rest imputed) from 20 Mexican American pedigrees from San Antonio, Texas. There are 21-76 individuals in each pedigree. Phenotype data include sex, age, year of examination, systolic and diastolic blood pressure (SBP and DBP), use of antihypertensive medications, and tobacco smoking at up to four time points.

Number of
Since Mexican American population is admixture population, association studies based on unrelated individuals from this population may be subjected to bias due to population stratification. For our purpose, we extract 132 genetically unrelated individuals from the 20 pedigrees with phenotypes and WGS data and select SBP as the trait of interest while take sex, age, use of antihypertensive medications, and tobacco smoking as covariates. For WGS data, we only consider one chromosome (chromosome 17). Among the 132 unrelated individuals, there are 404,032 SNPs on chromosome 17. Since the sample size is small, we only consider the 41,754 uncommon SNPs with MAF between 0.02 and 0.05 instead of including rare SNPs. We randomly draw 10,000 SNPs from the 41,754 SNPs without replacement and test association between the phenotype and each of the 10,000 SNPs using each of the four tests: Uncorrected, GC, PC-linear, and PC-nonp. We repeat the drawing procedure 4 times with re-drawing 10,000 SNPs from the 41,754 SNPs. Quantile-quantile plots of the observed − log 10 (P-values) of the four tests and expected log 10 (P-value) under the assumption of uniform distribution of P-values are given in Fig. 5. All quantile-quantile plots are averaged over 4 draws in order to show the average effect. Since we randomly draw 10,000 SNPs across chromosome 17, it is unlikely that there are a large number of SNPs in the 10,000 SNPs associated with SBP. Therefore, if population stratification can be well controlled for, P-values should proximately follow a uniform distribution. Figure 5 shows that only P-values of PC-nonp nearly follow a uniform distribution while for all other tests, the number of small P-values is more than expected.

Discussion
With the development of next-generation sequencing technology, there is increasing interest to detect associations between rare variants and complex traits. Many statistical methods have been developed for detecting rare variant associations. However, these methods may be subject to bias due to population stratification and, as pointed out by Mathieson and McVean 23 , existing methods developed to control for stratification are not necessarily effective in rare variant associations. Therefore, statistical methods that can control for population stratification in rare variant association studies are needed. In this article, we propose the PC-nonp approach to control for population stratification in rare variant association studies. To apply PC-nonp, we first calculate PCs of genotypes at the genomic markers. Then, we use these PCs to adjust population effects of both trait values and genotypes at a candidate locus by applying nonparametric regressions. Our simulations show that the proposed PC-nonp can control for population stratification well in all scenarios while existing methods cannot control for population stratification at least in some scenarios. Simulations also show that PC-nonp's robustness to population stratification will not reduce power. Applications to the GAW18 whole genome sequencing data set also show that our proposed method can control for population stratification better than existing methods. Although we describe our proposed method using a single-variant test and a weighted sum regional test, our method can be applied to most existing rare variant association tests such as CMC 1 , SKAT 9 , and TOW 8 . To apply our method to SKAT and TOW, denote y i and x im as the trait value and genotypic score at the m th variant of the i th individual. Let x is denote the residuals of nonparametric regressions y i = μ(p i ) + ε i and x is = μ s (p i ) + ε is , where i = 1, … , n and s = 1, … , S. Based on residuals ⁎ y i and ⁎ x is , we cannot use T 2 test because ⁎ x is are not 0 and 1. We can use a score test or the improved score test 36 .
Zhang et al. 41 proposed a semi-parametric test for association (SPTA) to control for population stratification. SPTA models the relationship between trait values, genotypic scores at the candidate marker, and PCs of genotypes at genomic markers through a semi-parametric model, where the exact form of relationship between trait values and PCs is assumed unknown, but trait values have linear relationship with genotypic scores at the candidate marker. Although SPTA and PC-nonp are equivalent for single-variant tests under quantitative traits, SPTA is difficult to extend to regional rare variant association tests such as SKAT and TOW because it is designed for single-variant tests.