Article | Open | Published:

A Nonparametric Regression Approach to Control for Population Stratification in Rare Variant Association Studies

Scientific Reports volume 6, Article number: 37444 (2016) | Download Citation

Abstract

Recently, there is increasing interest to detect associations between rare variants and complex traits. Rare variant association studies usually need large sample sizes due to the rarity of the variants, and large sample sizes typically require combining information from different geographic locations within and across countries. Although several statistical methods have been developed to control for population stratification in common variant association studies, these methods are not necessarily controlling for population stratification in rare variant association studies. Thus, new statistical methods that can control for population stratification in rare variant association studies are needed. In this article, we propose a principal component based nonparametric regression (PC-nonp) approach to control for population stratification in rare variant association studies. Our simulations show that the proposed PC-nonp can control for population stratification well in all scenarios, while existing methods cannot control for population stratification at least in some scenarios. Simulations also show that PC-nonp’s robustness to population stratification will not reduce power. Furthermore, we illustrate our proposed method by using whole genome sequencing data from genetic analysis workshop 18 (GAW18).

Introduction

Recently, there is increasing interest to detect associations between rare variants and complex traits. The variant by variant methods used to detect associations of common variants may not be optimal for detecting associations of rare variants due to allelic heterogeneity as well as the extreme rarity of individual variants1. Many statistical methods for testing the association of rare variants have been developed by using joint information of multiple variants in a genomic region. These methods can be roughly divided into three groups: burden tests, quadratic tests, and combined tests.

Burden tests1,2,3,4,5 collapse rare variants in a genomic region into a single burden variable and then regress the phenotype on the burden variable to test for the cumulative effects of rare variants in the region6. Burden tests implicitly assume that all rare variants are causal and directions of effects are all the same. Quadratic tests include tests with statistics of quadratic form of score vector7,8,9 and also adaptive weighting methods10,11,12,13. Quadratic tests are robust to directions of effects of causal variants and are less affected by neutral variants than burden tests do. If most of the rare variants are causal and directions of effects of causal variants are all the same, burden tests can outperform quadratic tests; otherwise, quadratic tests perform better. Combined tests6,14 combine information from burden tests, quadratic tests, and possibly other tests aiming to have advantages of multiple tests and to increase the robustness of tests.

All the aforementioned methods are population-based methods for unrelated individuals. It has been long recognized that, for population-based association studies, population stratification can seriously confound association results15,16. For rare variants this problem can be more serious, because the spectrum of rare variations can be very different in different populations. In common variant association studies, several methods that use a set of genomic markers genotyped in the same samples have been developed to control for population stratification. These methods include genomic control (GC) approach17,18,19, principal component (PC) based linear regression (PC-linear) approach20, and mixed linear model (MLM) approach21,22 among others. GC approach adjusts the ordinary chi-square test statistic X2 to X2/λ and assumes X2/λ follows a chi-square distribution, where the inflation factor λ can be estimated using genotypes at genomic markers. PC-linear approach summarizes the genetic background or ancestry information through the PCs of genotypes at genomic markers. The PCs can be further used to eliminate the effect resulting from population stratification through linear regressions. MLM approach corrects for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals.

Although several methods for controlling for population stratification have been developed for common variants, it remains unclear whether these methods are equally effective for rare variants. Because rare variants have typically arisen recently, they tend to show greater geographic clustering or more latent subpopulations than common variants that are typically older. The more geographic clusters or latent subpopulations, the more difficult it will be to control for population stratification. Mathieson and McVean23 demonstrated that rare variants can show a stratification that is systematically different from common variants. They also demonstrated that the commonly used methods such as GC, PC-linear, and MLM to control for population stratification in common variant associations are not necessarily controlling for population stratification in rare variant associations. Zhang et al.24 showed that the use of PCs calculated from common variants were effective to control for population stratification in rare variant associations. Jiang et al.25 also found that the PC based methods performed quite well while GC often yielded lower power. Note that both studies of Zhang et al.24 and Jiang et al.25 did not explicitly model the spatial structure of populations in their simulation studies. Zhang et al.24 used two continental groups from the 1000 Genomes Project with six and four subpopulation groups, respectively. Jiang et al.25 simulated data with two populations. Lissgarten et al.26 reported that FaST-LMM Select (a MLM approach) could control for population stratification when samples were from spatially structured populations. However, their approach reduced power substantially when causal rare variants are spatially clustered26,27.

In this article, we propose a PC based nonparametric regression (PC-nonp) approach to control for population stratification in rare variant association studies. PC-nonp adjusts population effects of both trait values and genotypes at candidate loci for PCs of genotypes at genomic markers by applying nonparametric regressions. We use extensive simulation studies to evaluate the performance of the proposed method PC-nonp and compare the performance of PC-nonp with that of GC and PC-linear developed for common variants and recently proposed biased urn permutation test (BiasePerm)28 developed for rare variants. Simulation results show that PC-nonp can control for population stratification well in all scenarios while GC, PC-linear, and BiasedPerm cannot control for population stratification at least in some scenarios. Results also show that PC-nonp’s robustness to population stratification will not reduce power. Furthermore, we evaluate the performance of our approach by applying it to the whole genome sequencing data from genetic analysis workshop 18 (GAW18) and find that only PC-nonp is effective to control for population stratification.

Method

Consider a sample of n unrelated individuals. Suppose that each individual has been genotyped at a candidate locus (single variant or multiple variants) and at L genomic markers. Let yi, xi, and pi denote the trait value, genotypic score at the candidate locus (weighted sum of genotypic scores if there are multiple variants), and the first k PCs (rescaled to the interval [0, 1]) of genotypes at genomic markers of the ith individual. The PCs of genotypes at genomic markers are good summary measures of ancestry or genetic background. PC-linear is probably the most popular method to control for population stratification. However, this method is based on linear combinations of PCs. Furthermore, recently developed BiasePerm28 is based on linear combinations of PCs on logistic scale if we use PCs as covariate vector28. The relationships between trait values and PCs can be highly nonlinear and population effects cannot be corrected by simply using linear functions23. Figure 1 shows the relationships between trait values and the first two PCs of genotypes at 10,000 genomic markers in two structured populations. This figure shows that the relationships between trait values and PCs are highly nonlinear and the forms of the relationships are different in different populations. When the relationships are highly nonlinear and the forms of relationships are unknown, we should use more flexible regression methods rather than use linear regression. Nonparametric regression is a very flexible regression method and it does not require the form of regression function.

Figure 1: The relationships between the first two PCs of genotypes at 10,000 genomic markers and trait values.
Figure 1

Genotypes at 10,000 genomic markers in the spatially structured populations are generated according to simulation set 2 in simulation section. Genotypes at 10,000 genomic markers in a population with 10 subpopulations are generated according to simulation set 1 in simulation section. The trait values in the spatially structured populations and in a population with 10 subpopulations are generated according to null distributions (without random error) in simulation set 2 and simulation set 1, respectively.

In this article, we propose a PC based nonparametric regression (PC-nonp) approach that adjusts population effects of both trait values and genotypes at candidate loci for PCs of genotypes at genomic markers by applying nonparametric regressions. That is,

where μ1(·) and μ2(·) are regression functions with unknown forms and will be estimated using smoothing techniques. Let and be the residuals of the nonparametric regressions. We can consider and as the trait value and genotypic score at the candidate locus of the ith individual after adjusting for population effects. We can construct association tests based on the residuals.

Many methods have been developed to estimate the unknown regression function, including local linear method29,30,31, kernel smoothing method32,33 and wavelet method34,35. We propose to use kernel smoothing method. Let K(·) be a kernel function with mode at 0. The kernel estimators of μ1(pi) and μ2(pi) are given by

respectively, where pj = (pj1, …, pjk) is the first k PCs for the jth individual, H = (h1, …, hk) is the smoothing parameter, and  = . If we denote , then and . With these nonparametric estimators, the fitted values of trait and the fitted values of genotypic scores at the candidate locus are given by and , respectively. Intuitively, and are the weighted mean of trait values and weighted mean of genotypic scores of those individuals whose genetic background is similar to that of the ith individual. Thus, we can consider residuals and as the trait value and genotypic score of the ith individual after adjusting for population stratification.

In this study, we use the quartic kernel33,

For computational consideration, we assume that h1 = ... = hk = h. Then,

To test association between trait values and genotypes based on and , we can use score test with test statistic Tscore = U2/V, where and . The statistic Tscore asymptotically follows a chi-square distribution with one degree of freedom (df)36. For rare variants, xi can be a weighted combination2 or collapsing1,3 of genotypes at multiple variants in a genomic region. Based on the residuals of the nonparametric regression, we can construct other rare variant association tests such as CMC1, SKAT9, and TOW8. We will discuss this issue in more details later in the discussion section. In this study, we use a single-variant test in which xi is the genotypic score of a single variant and a regional test in which xi is the weighted combination of genotypes at the variants in a genomic region2 to evaluate the performance of our proposed method.

We have so far assumed a given smoothing parameter in the kernel estimates. It is well known that choosing a proper value for smoothing parameter h is critical to kernel estimates of regression functions32,37. We use a method similar to that of Zhang et al.35 to choose smoothing parameter h. This method is based on the genotypes at a set of genomic markers. Suppose there are L genomic markers. We perform PC-nonp single-variant test for all the L genomic markers and denote P1, …, PL as the associated P-values. If population stratification is well controlled for, P-values P1, …, PL should follow a uniform distribution under the null hypothesis of no association. Let Fn be the empirical distribution function of the P-values P1, …, PL and F be the uniform distribution function. The Kolmogorov test statistic measures how close the distribution of the P-values P1, …, PL and the uniform distribution are. We propose to choose h* that minimizes the Kolmogorov test statistic, i.e.,

as the value of the smoothing parameter. h* can be obtained by a simple grid search across a range of h. We divide the interval [0, ∞)into subintervals 0 ≤ h1 < … < hS−1 < hS < ∞. Then, . The computational time to find h* increases linearly with S. However, h* needs to be calculated only once. We can use this h* to calculate the residuals of the nonparametric regression for trait values and genotypes at each variant. Let k denote the number of PCs used. In this study, we use hs = 22(s−23)/(5+k), where s = 1, …, 30 and k = 10. It is worth noting that the smoothing parameter h is chosen with the P-values of a single-variant test, whichever test is actually used in testing associations.

Software

R code for implementing our proposed method is given at Shuanglin Zhang’s homepage http://www.math.mtu.edu/~shuzhang/software.html. The R code includes three functions: PCA, choose_OPT_SMP, and Resid_Nonp. PCA gives the first k principal components of genotypes at genomic markers. choose_OPT_SMP chooses the optimal value of smoothing parameter. Given the value of the smoothing parameter, Resid_Nonp calculates the residuals of trait values and genotypes at a candidate region by applying nonparametric regression for PCs of genotypes at genomic markers.

Comparison of Tests

We compare the performance of the proposed test with that of the following four tests. (1) Uncorrected: this test is also based on the score test statistic . is the same as Tscore but is based on the original trait values yi and genotypic scores xi instead of based on the residuals. (2) GC17: GC divides by an inflation factor λ and , where is the value of when is applied to the lth genomic marker. (3) PC-linear20: this test is the same as PC-nonp but PC-linear is based on the residuals of linear regression instead of based on the residuals of nonparametric regression. (4) The biased urn permutation test (BiasedPerm)28: in this permutation procedure, the odds of a subject being selected as a case are equal to his or her odds of disease conditional on confounder variables. In this study, PC-linear, PC-nonp, and BiasedPerm are based on the first 10 PCs of genotypes at the genomic markers.

Simulations

We consider two sets of simulations: populations with k0 subpopulations and populations with spatially structured populations. In each set of simulations, we consider both qualitative and quantitative traits. To generate a qualitative disease affection status, we use a liability threshold model based on a continuous phenotype (quantitative trait). An individual is defined to be affected if the individual’s phenotype is at least one standard deviation larger than the phenotypic mean. This yields a prevalence of 16% for the simulated disease in the general population. In the following, we describe how to generate genotypes and how to generate a quantitative trait in the two sets of simulations.

Simulation Set 1: Populations with k 0 Subpopulations

This set of simulations is based on allele frequencies at 24,487 variants calculated from the empirical Mini-Exome genotype data provided by the genetic analysis workshop 17 (GAW17). The genotypes of GAW17 data set are extracted from the sequence alignment files provided by the 1000 Genomes Project for their pilot3 study (http://www.1000genomes.org). GAW17 data contain genotypes of 697 unrelated individuals at 24,487 variants. The distributions of MAF at rare variants (MAF < 0.01) and MAF at common variants of 24,487 variants are given in Figure S1.

To generate genotypes of individuals in a population with k0 subpopulations, we follow Price et al.20, Ionita-Laza et al.38, and Qin et al.39. For each variant, we randomly select a variant from 24,487 variants and take the MAF at this variant as the ancestral population allele frequency p. Then, independently draw k0 values from a beta-distribution with parameters p(1 − Fst)/Fst and (1 − p)(1 − Fst)/Fst, where Fst is the Wright’s measure of population subdivision40 (in this study, Fst = 0.01). For each variant, we accept as allele frequencies for the k0 subpopulations if ; we redraw otherwise. The MAF distributions at the rare variants (MAF < 0.01) and at the common variants for k0 = 5 are given in Figure S1.

To evaluate type I error, we generate trait values independent of genotypes by using the model:

where yij denotes the trait value of the jth individual in the ith subpopulation, εij follows a standard normal distribution, and μi is the population mean of the ith subpopulation. In this study, if k0 ≤ 2, we set μ1 = 0 and μ2 = μ; otherwise, we set and , where μ = 5 if k0 = 20; otherwise μ = 2.

To evaluate power, we consider nT variants (possibly both rare and common variants) in a genomic region. We randomly choose nc from the nT variants as causal variants (in this study, nc = nT/2). For the jth individual in the ith subpopulation, let xijl denote the genotypic score of the jth individual in the ith subpopulation at the lth causal variant. We assume that all the nc causal variants have the same heritability such that rarer variants have larger effects. Under this assumption, the disease model is given by

where βl are constants and their values depend on the total heritability.

Simulation Set 2: Spatially Structured Populations

We generate genotypes and phenotypes under spatially structured populations using the methods similar to those of Mathieson and McVean23. Briefly, the space is divided into K0 × K0 grid squares. Then, we generate genotypes by starting with a number of individuals and their locations on the grid. We work backward in time to generate random genealogical events. Each event is either a coalescence of two lineages or a migration of a single lineage from one square to another. The relative rates of coalescence and migration depend on the population-scaled migration rate M and the number and distribution of lineages on the grid (see Supplement materials or Mathieson and McVean23 for details).

To generate quantitative traits under null hypothesis, let ϕ: |1, n| → |1, K0| × |1, K0| be a function that maps each individual to the grid square from which they originated. Then, we generate the trait value of the ith individual by yi = βRϕ(i) + εi, where ϕ(i) = (l, j) if the ith individual originates from grid square l, j; Rl,j is the nongenetic risk in grid square l, j; εi is a standard normal random number; and β is a constant. We use the following three models to determine the value of Rl,j. Model 0: no population stratification in which Rl,j = 0 for all l and j. Model 1: a small and sharp spatial distribution in which Rl,j = 1 if l0 ≤ l ≤ l0 + 3 and j0 ≤ j ≤ j0 + 3 for l0 = j0 = 6, or 20 − l0 = j0 = 6, or l0 = j0 = 14; Rl,j = 0 otherwise. Model 2: a wide and smooth spatial distribution in which for l0 = j0 = 6. In this study, we use the following parameters: K0 = 20, M = 0.01, and β = 2.

Under alternative hypothesis, we assume that there are nT variants in a genomic region. We randomly choose nc from the nT variants as causal variants. For an individual, let xl denote the genotypic score at the lth causal variant. Under the assumption that all the nc causal variants have the same heritability, the trait value for an individual is generated by

where y0 is the trait value generated under null hypothesis.

Results

Existence of the minimum of Kolmogorov test statistic Kol(h)

We first perform simulation studies to evaluate the existence of the minimum of Kolmogorov test statistic Kol(h). We generate trait values and genotypes at 10,000 variants under simulation set 1 for k0 = 5 and k0 = 10 and under simulation set 2 models 1 and 2. Under each of the four scenarios, we calculate Kol(h) for different values of h. The relationships between Kol(h) and −log(h) under the four scenarios are given in Fig. 2. This figure shows that the curves of Kol(h) under the four scenarios are all bowl shaped and thus have minimum. The histograms of 10,000 P-values of the proposed test for different values of h are given in Figures S2–S5 for the four scenarios, respectively. From these figures, we can see that when h is large, population effects are not adjusted enough and thus the number of small P-values are more than expected; when h is small, population effects are over adjusted and thus the number of large P-values are more than expected; when h minimizes Kol(h), the distribution of P-values is very close to the uniform distribution.

Figure 2: The relationships between −log(smoothing parameter h) and Kolmogorov test statistic Kol(h) in four structured populations.
Figure 2

Evaluate type I error rates

We use 10,000 replicated samples to evaluate type I error rates. For BiasedPerm, we use 5,000 permutations to evaluate P-values. For all other tests, we use asymptotic distributions to evaluate P-values. For 10,000 replicated samples, the 95% confidence intervals (CIs) for type I error rates of nominal levels 0.01 and 0.001 are (0.008, 0.012) and (0.00037, 0.00163), respectively.

To evaluate type I error rates, we first want to see the performance of the asymptotic distributions we used. For this purpose, we perform simulations under null hypothesis in a homogenous population (k0 = 1 in simulation set 1) and in the case of no population stratification (model 0 in simulation set 2). Type I error rates are given in Tables 1 and 2 for quantitative traits and qualitative traits, respectively. Table 1 shows that, for quantitative traits, type I error rates of all the four tests in all the scenarios are within the corresponding 95% confidence intervals, which indicates that the asymptotic distributions work very well. Table 2 shows that, for qualitative traits, most of the type I error rates are within the corresponding 95% CIs and those of the type I error rates that are not in the 95% CIs are very close to the corresponding 95% CIs, which indicates that the asymptotic distributions approximately work well.

Table 1: Type I error rates of four tests in a homogenous population and in the case of no population stratification for quantitative traits.
Table 2: Type I error rates of four tests in a homogenous population and in the case of no population stratification for qualitative traits.

Type I error rates under structured populations in simulation set 1 for k0 = 2, 10, 20 are given in Tables 3 and 4 for quantitative traits and qualitative traits, respectively. As shown by these two tables, Uncorrected has inflated type I error rates in all the scenarios. GC cannot control for population stratification for quantitative traits when k0 = 10 and 20 because most variants have very small correlation with the trait. PC-linear and BiasedPerm cannot control for population stratification when k0 = 20 because the linear combinations of the first 10 PCs cannot discriminate 20 subpopulations. Only PC-nonp can control for population stratification in all simulation scenarios. If we increase the number of PCs, PC-linear and BiasedPerm may control for population stratification when k0 = 20. The problems to use PC-linear and BiasedPerm to control for population stratification are (1) we do not know how many PCs should be used and (2) increasing the number of PCs may decrease the power.

Table 3: Type I error rates of four tests based on simulation set 1 for quantitative traits.
Table 4: Type I error rates of five tests based on simulation set 1 for qualitative traits.

Type I error rates under spatially structured populations in simulation set 2 for models 1 and 2 are given in Tables 5 and 6 for quantitative traits and qualitative traits, respectively. These two tables show that Uncorrected has inflated type I error rates in all the scenarios. GC cannot control for population stratification for single variant test because most variants have very small correlation with the trait. PC-linear and BiasedPerm have inflated type I error rates under model 1 because these two methods try to correct highly nonlinear relationships on the basis of linear functions of relatedness. PC-nonp can control for population stratification well in all simulation scenarios because nonparametric regressions can adapt any function, linear or nonlinear.

Table 5: Type I error rates of four tests based on simulation set 2 for quantitative traits.
Table 6: Type I error rates of five tests based on simulation set 2 for qualitative traits.

Power comparison

To evaluate if PC-nonp’s robustness to population stratification will reduce power, we perform simulation studies to compare power using regional tests under k0 = 1 and k0 = 10 in simulation set 1 and under models 0 and 2 in simulation set 2, in which all tests except Uncorrected can control for population stratification well. Power comparisons under k0 = 1 and k0 = 10 in simulation set 1 are given in Fig. 3. This figure shows that, when there is no population stratification (a homogenous population), all tests have very similar powers. When there is population stratification (a structured population with 10 subpopulations), PC-nonp and PC-linear are more powerful than Uncorrected and BiasedPerm, and GC has the lowest power. GC loses power because it has a larger inflation factor when there is population stratification. BiasedPerm essentially performs permutation within subpopulations and thus it will lose power when there are a large number of subpopulations. Uncorrected loses power because, in the structured population with 10 subpopulations, different trait value means in subpopulations weaken the association signal. PC-nonp and PC-linear do not lose power because, after adjusted for population effects, it appears that PC-nonp and PC-linear perform association tests in a homogenous population.

Figure 3: Power comparisons based on populations with k0 subpopulations.
Figure 3

“Homo” means that simulations are based on a homogenous population (k0 = 1 in simulation set 1). “Structured” means that simulations are based on a structured population with 10 subpopulations.

Power comparisons under models 0 and 2 in simulation set 2 are given in Fig. 4. As shown by this figure, for quantitative traits, the pattern of power comparisons is very similar to that in Fig. 3. For qualitative traits, Uncorrected is the most powerful one. The pattern of power comparisons among PC-nonp, PC-linear, BiasedPerm, and GC is very similar to that in Fig. 3.

Figure 4: Power comparisons based on spatially structured populations.
Figure 4

“No stratification” means that trait values have no relation with spatial position (model 0 in simulation set 1). “Structured” means that trait values are generated according to spatially structured model 2.

Analysis of GAW18 whole genome sequencing data set

The data set for GAW18 includes whole genome sequencing (WGS) data of 959 individuals (464 directly sequenced and the rest imputed) from 20 Mexican American pedigrees from San Antonio, Texas. There are 21–76 individuals in each pedigree. Phenotype data include sex, age, year of examination, systolic and diastolic blood pressure (SBP and DBP), use of antihypertensive medications, and tobacco smoking at up to four time points.

Since Mexican American population is admixture population, association studies based on unrelated individuals from this population may be subjected to bias due to population stratification. For our purpose, we extract 132 genetically unrelated individuals from the 20 pedigrees with phenotypes and WGS data and select SBP as the trait of interest while take sex, age, use of antihypertensive medications, and tobacco smoking as covariates. For WGS data, we only consider one chromosome (chromosome 17). Among the 132 unrelated individuals, there are 404,032 SNPs on chromosome 17. Since the sample size is small, we only consider the 41,754 uncommon SNPs with MAF between 0.02 and 0.05 instead of including rare SNPs. We randomly draw 10,000 SNPs from the 41,754 SNPs without replacement and test association between the phenotype and each of the 10,000 SNPs using each of the four tests: Uncorrected, GC, PC-linear, and PC-nonp. We repeat the drawing procedure 4 times with re-drawing 10,000 SNPs from the 41,754 SNPs. Quantile-quantile plots of the observed −log10(P-values) of the four tests and expected log10(P-value) under the assumption of uniform distribution of P-values are given in Fig. 5. All quantile-quantile plots are averaged over 4 draws in order to show the average effect. Since we randomly draw 10,000 SNPs across chromosome 17, it is unlikely that there are a large number of SNPs in the 10,000 SNPs associated with SBP. Therefore, if population stratification can be well controlled for, P-values should proximately follow a uniform distribution. Figure 5 shows that only P-values of PC-nonp nearly follow a uniform distribution while for all other tests, the number of small P-values is more than expected.

Figure 5: Quantile-quantile plots of observed −log10(P-values) of four tests and expected −log10(P-value).
Figure 5

All quantile-quantile plots are averaged over 4 times replicated draws in order to show the average effect. Each draw, we randomly choose 10,000 SNPs without overlap from the 41,754 uncommon SNPs with MAF between 0.02 and 0.05 on chromosome 17.

Discussion

With the development of next-generation sequencing technology, there is increasing interest to detect associations between rare variants and complex traits. Many statistical methods have been developed for detecting rare variant associations. However, these methods may be subject to bias due to population stratification and, as pointed out by Mathieson and McVean23, existing methods developed to control for stratification are not necessarily effective in rare variant associations. Therefore, statistical methods that can control for population stratification in rare variant association studies are needed. In this article, we propose the PC-nonp approach to control for population stratification in rare variant association studies. To apply PC-nonp, we first calculate PCs of genotypes at the genomic markers. Then, we use these PCs to adjust population effects of both trait values and genotypes at a candidate locus by applying nonparametric regressions. Our simulations show that the proposed PC-nonp can control for population stratification well in all scenarios while existing methods cannot control for population stratification at least in some scenarios. Simulations also show that PC-nonp’s robustness to population stratification will not reduce power. Applications to the GAW18 whole genome sequencing data set also show that our proposed method can control for population stratification better than existing methods.

Although we describe our proposed method using a single-variant test and a weighted sum regional test, our method can be applied to most existing rare variant association tests such as CMC1, SKAT9, and TOW8. To apply our method to SKAT and TOW, denote yi and xim as the trait value and genotypic score at the mth variant of the ith individual. Let and denote the residuals of nonparametric regressions yi = μ(pi) + εi and xim = μm(pi) + εim, where i = 1, …, n and m = 1, …, M. Based on the residuals and , the test statistics of both SKAT and TOW can be written as , where . In TOW, while, in SKAT, , the beta distribution density function with pre-specified parameters a1 and a2 evaluated at the sample MAF for the mth variant in the data. To apply our method to CMC, suppose that M variants can be classified as Sr groups of rare variants and Sc individual variant sites. Define indicator variables xis(i = 1, …, n; s = 1, …, Sr) for all individuals and the Sr groups of rare variants, where xis = 1 if minor alleles at any variant in the sth group of the ith individual are present; xis = 0 otherwise. Let S = Sr + Sc and define (s = 1, …, Sc) as the genotypic score of the ith individual at the sth individual variant site. Let and denote the residuals of nonparametric regressions yi = μ(pi) + εi and xis = μs(pi) + εis, where i = 1, …, n and s = 1, …, S. Based on residuals and , we cannot use T2 test because are not 0 and 1. We can use a score test or the improved score test36.

Zhang et al.41 proposed a semi-parametric test for association (SPTA) to control for population stratification. SPTA models the relationship between trait values, genotypic scores at the candidate marker, and PCs of genotypes at genomic markers through a semi-parametric model, where the exact form of relationship between trait values and PCs is assumed unknown, but trait values have linear relationship with genotypic scores at the candidate marker. Although SPTA and PC-nonp are equivalent for single-variant tests under quantitative traits, SPTA is difficult to extend to regional rare variant association tests such as SKAT and TOW because it is designed for single-variant tests.

Additional Information

How to cite this article: Sha, Q. et al. A Nonparametric Regression Approach to Control for Population Stratification in Rare Variant Association Studies. Sci. Rep. 6, 37444; doi: 10.1038/srep37444 (2016).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    & Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 83, 311–21 (2008).

  2. 2.

    & A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic. Plos Genetics 5 (2009).

  3. 3.

    & A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat Res 615, 28–56 (2007).

  4. 4.

    et al. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet 86, 832–8 (2010).

  5. 5.

    et al. Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. Am J Hum Genet 87, 604–17 (2010).

  6. 6.

    et al. Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am J Hum Genet 91, 224–37 (2012).

  7. 7.

    et al. Testing for an unusual distribution of rare variants. PLoS Genet 7, e1001322 (2011).

  8. 8.

    , , & Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genet Epidemiol 36, 561–71 (2012).

  9. 9.

    et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 89, 82–93 (2011).

  10. 10.

    & A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered 70, 42–54 (2010).

  11. 11.

    , & Comprehensive approach to analyzing rare genetic variants. PLoS One 5, e13584 (2010).

  12. 12.

    & A general framework for detecting disease associations with rare variants in sequencing studies. Am J Hum Genet 89, 354–67 (2011).

  13. 13.

    & Bayesian analysis of rare variants in genetic association studies. Genet Epidemiol 35, 57–69 (2011).

  14. 14.

    , & Robust and powerful tests for rare variants using Fisher’s method to combine evidence of association from two or more complementary tests. Genet Epidemiol 37, 110–21 (2013).

  15. 15.

    , , & Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am J Hum Genet 43, 520–6 (1988).

  16. 16.

    & Genetic dissection of complex traits. Science 265, 2037–48 (1994).

  17. 17.

    & Genomic control for association studies. Biometrics 55, 997–1004 (1999).

  18. 18.

    , & Genomic control, a new approach to genetic-based association studies. Theor Popul Biol 60, 155–166 (2001).

  19. 19.

    & Detecting association in a case-control study while correcting for population stratification. Genetic Epidemiology 20, 4–16 (2001).

  20. 20.

    et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904–9 (2006).

  21. 21.

    et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet 42, 348–54 (2010).

  22. 22.

    et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet 42, 355–60 (2010).

  23. 23.

    & Differential confounding of rare and common variants in spatially structured populations. Nat Genet 44, 243–6 (2012).

  24. 24.

    , & Adjustment for population stratification via principal components in association analysis of rare variants. Genet Epidemiol 37, 99–109 (2013).

  25. 25.

    , & Assessing the impact of population stratification on association studies of rare variation. Hum Hered 76, 28–35 (2013).

  26. 26.

    , & FaST-LMM-Select for addressing confounding from spatial structure and rare variants. Nat Genet 45, 470–1 (2013).

  27. 27.

    & Reply to: “FaST-LMM-Select for addressing confounding from spatial structure and rare variants”. Nat Genet 45, 471 (2013).

  28. 28.

    et al. A permutation procedure to correct for confounders in case-control studies, including tests of rare variation. Am J Hum Genet 91, 215–23 (2012).

  29. 29.

    Local linear regression smoothers and their minimax efficiencies. The Annals of Statistics, 196–216 (1993).

  30. 30.

    & Local linear estimation in partly linear models. Journal of Multivariate Analysis 60, 1–19 (1997).

  31. 31.

    & Cross-validated local linear nonparametric regression. Statistica Sinica, 485–512 (2004).

  32. 32.

    Smoothing methods in statistics, (Springer Science & Business Media, 2012).

  33. 33.

    Kernel smoothing in partial linear models. Journal of the Royal Statistical Society. Series B (Methodological), 413–436 (1988).

  34. 34.

    & Adapting to unknown smoothness via wavelet shrinkage. Journal of the american statistical association 90, 1200–1224 (1995).

  35. 35.

    & Wavelet threshold estimation for additive regression models. Annals of Statistics, 152–173 (2003).

  36. 36.

    , & An improved score test for genetic association studies. Genet Epidemiol 35, 350–9 (2011).

  37. 37.

    Nonparametric smoothing and lack-of-fit tests, (Springer Science & Business Media, 2013).

  38. 38.

    , , & Genomewide weighted hypothesis testing in family-based association studies, with an application to a 100 K scan. Am J Hum Genet 81, 607–4 (2007).

  39. 39.

    , , & A data-driven weighting scheme for family-based genome-wide association studies. Eur J Hum Genet 18, 596–603 (2010).

  40. 40.

    & A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12 (1995).

  41. 41.

    , & On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet Epidemiol 24, 44–56 (2003).

Download references

Acknowledgements

Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Numbers R15HG008209 (Sha Q and Zhang S) and R01 HG008115 (Sha Q and Zhang K). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. Preparation of the Genetic Analysis Workshop 17 and 18 Simulated Exome Data Set was supported in part by NIH R01 MH059490 and used sequencing data from the 1000 Genomes Project (www.1000genomes.org).

Author information

Affiliations

  1. Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, USA

    • Qiuying Sha
    • , Kui Zhang
    •  & Shuanglin Zhang

Authors

  1. Search for Qiuying Sha in:

  2. Search for Kui Zhang in:

  3. Search for Shuanglin Zhang in:

Contributions

Q.S. and S.Z. designed research, S.Z. performed statistical analysis, and Q.S., K.Z., and S.Z. wrote the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Shuanglin Zhang.

Supplementary information

Word documents

  1. 1.

    Supplemental Materials

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/srep37444

Further reading

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.