Introduction

Current wave of genome-wide association studies (GWAS) focusing on relatively common single-nucleotide polymorphisms (SNPs) (minor allele frequency (MAF)>5%) have successfully identified hundreds of loci associated with risk of various diseases. To date, more than 5900 SNPs have been reported to be associated with different diseases (http://www.genome.gov/gwastudies/). However, some studies have suggested that the genetic variants for common diseases could have a wide spectrum of frequencies, ranging from rare to common, and that rare variants could exhibit a relatively large genetic effect (for example, odds ratio greater than 2).1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 For example, in 2008, Stefansson et al.10 found that three rare deletions were associated with schizophrenia with the odds ratios of 2.7, 11.5 and 14.8, respectively. Some authors have proposed novel methods to detect associations with multiple rare variants for common diseases.16, 17, 18, 19, 20

Current GWAS with common SNPs usually adopt a cost-effective staged design in which a proportion of the available sample are genotyped using a commercially available SNP array with a reasonable coverage of the whole genome at the initial stage, and a list of promising SNPs with P-values less than a given threshold are further genotyped on the rest of the samples at the second stage.21, 22, 23, 24 For data analysis, it is generally more powerful to use the joint analysis strategy that combines the statistics from two stages for the final evaluation of the association evidence comparing with the replication-based analysis that only uses the statistic from the second stage.24, 25, 26

In principle, this staged design can be used for future GWAS or candidate gene association studies focusing on rare genetic variants. However, the analysis of rare variant under such a multi-staged design is quite different from that of common variants mainly due to the invalidity of the large sample theorem. It also depends on the type of statistic used for the association test. Here, we focus on the beta test proposed by Li et al.27 and show how to use it for the joint analysis under a two-stage design.

Materials and methods

Beta test

By considering the probabilities of the event that a rare event occurring in Population 1 conditional on that in Population 2, Li et al.27 derived uniform test and beta test, and recommended to use beta test to evaluate the association of single rare variant. Their beta test can be summarized as follows. Assume that a rare event occurs x times among n1 independent trials in Population 1 and y times in n2 independent trials in Population 2. The null hypothesis is that the probability of the rare event occurring in Population 1 is equal to that in Population 2. For common case–control designs, it is assumed that cases group is Population 1 and controls group is Population 2. Then they calculated the conditional probability

where B(·) is a beta function with . The two-sided P-value for beta test is given by

where I(·) is an indicator function.

The proposed procedure

Assume that there are r cases and s controls randomly drawn from the source population with the proportion of subjects π genotyped in Stage I in a two-stage case–control genetic association study. We assume that cases and controls are sampled from a homogeneous general population in which the Hardy–Weinberg equilibrium holds in control population and no stratification exists. The genotype count for a given biallelic marker (denote two alleles by A and a, and A is the rare or high-risk allele) in both stages are given in Table 1 (collapse the genotype count of AA with that of Aa because the number of individuals in both cases and controls with genotype AA is extremely rare). Let γ be the significance level used to select the SNPs for follow-up and α be the significance level for each SNP at a genome-wide level (γ>α). Let H0 be the null hypothesis that a SNP is not associated with disease, and the corresponding alternative hypothesis be H1 (that is, the negation of H0). Based on the data in Table 1, the conditional probabilities are calculated,

Table 1 Genotype counts for both stages (A is the rare and high-risk allele)

Then P-values of a SNP in both stages are, respectively, equal to

Let’s consider the replication-based analysis first. Denote the threshold of P-values for selecting SNPs to be followed up from Stage I by γ, and the threshold using the beta test in Stage II by b. Then, by controlling the false-positive rate at a chosen α level, we can obtain b according to the following equation

According to the Lemma (the proof of the Lemma takes advantage of Lemma 1 of Li et al.28) in Section A.1 of the Appendix, we have P1 U(0,1) and P2 U(0,1) as min(r,s) →∞ and r/s φ<∞. So b=α/γ. The power is . Under a specific alternative hypothesis, we proposed to use the following procedure to calculate the power.

  1. 1)

    Generate B data sets (for example, B=1 × 104) under the alternative hypothesis.

  2. 2)

    For each data set, use beta test to calculate P1 and P2, denote them by P1(1), …, P1(B), and P2(1),…, P2(B), respectively.

  3. 3)

    βR can be approximated by

A joint testing statistic can be defined as

Given the type I error level of α and the threshold of P-values in Stage I of γ, the threshold c for the final joint analysis can be chosen to satisfy the following condition

In Section A.2 of the Appendix, we show that c can be solved through the following equation,

Under a specific alternative hypothesis, the power for the joint analysis is given by

We propose to use the following procedure to calculate βJ.

  1. 1)

    Generate B data sets (for example, B=1 × 104) under the alternative hypothesis.

  2. 2)

    For each data set, use beta test to calculate P1 and P2, denote them by P1(1), …, P1(B), and P2(1),…, P2(B), respectively. Then we calculate PJ(1), …, PJ(B), where .

  3. 3)

    βJ can be approximated by

Results

In this section, we conduct simulation studies to evaluate the performances of the proposed procedures. We mainly compare type I error rates and power of the replication-based analysis and the joint analysis. We also apply them to a real data set from two independent genetic association studies of rheumatoid arthritis (RA) to demonstrate the advantages of the proposed methods.

Type I error rate

We first validate whether the proposed procedures can maintain the correct type I error rates. Data are generated under the null hypothesis with disease prevalence K=0.1, and the number of cases and controls r = s = 2000. In addition, 0.01 is chosen for γ to be comparable to other studies (for example, Schaid and Sinnwell, 2010).25 Assume that the risk allele is the minor allele, and the MAF is chosen among three levels: 0.005, 0.007 and 0.01. Because of the computation complexity, only the empirical type I error rate at α=0.0001, with π of 0.2, 0.3, 0.4 and 0.5, is evaluated. One million replicates are conducted to calculate the empirical type I error rates. The results (displayed within Table 2) show that both analyses (the replication-based analysis and the joint analysis) can properly control the type I error rate with the empirical values close to the nominal level, 0.0001. For example, when MAF=0.007 and π=0.3, type I error levels of the replication-based analysis and joint analysis are 0.000093 and 0.000098, respectively.

Table 2 Empirical type I error rates (K=0.1, r = s = 2000, γ=0.01 and α=0.0001)

In a GWAS, significance level of 5 × 10−8 is often used. Because of time limitation, we only consider α=0.0001. As shown in Lemma, the proposed test could keep the type I error rates at significance level of 5 × 10−8.

Power

Assume there are 1 000 000 SNPs genotyped in a GWAS. The whole genome-wide type I error is 0.05. For each SNP, the significance level is 5 × 10−8 under the Bonferroni correction. We set γ to be 0.0001, 0.001 and 0.01, and π to be 0.2, 0.3, 0.4 and 0.5. Dominant genetic model with the relative risk 3 is used. Three minor allele frequencies are chosen, 0.005, 0.007 and 0.01. The disease prevalence K is fixed to be 0.1, and 5000 replicates are conducted to calculate the empirical power. The controls are sampled from the general population.

Results of power are summarized in Figure 1. As expected, the joint analysis is more powerful than the replication-based analysis. The power of both analyses increases with increase in the proportion of subjects genotyped in Stage I. The difference in power between both analyses increases as π or γ increases. For example, when MAF=0.007, π=0.3 and γ=0.01, the power difference between both analyses is 0.07, while this difference becomes 0.32 for π=0.5.

Figure 1
figure 1

Power of a two-stage design for the replication-based analysis (replication) and the joint analysis (joint). Both of the number of cases and controls are equal to 2000. The genome-wide significance level per SNP is 5 × 10–8 and π is the proportion of subjects genotyped in Stage I. The genotype relative risk for one and two copy risk allele is equal to 3 and the disease prevalence is 0.1. A full color version of this figure is available at the Journal of Human Genetics journal online.

We also evaluate the performances of these two analyses under the situation where there is difference in the rare allele frequency between the samples from two stages. We generate the data similarly but with MAF in Stage I being 0.005 and MAF in Stage II being 0.01. Figure 2 shows the power results. Again, the joint analysis is generally more powerful than the replication-based analysis. We also find that both analyses have the similar performances when π is relatively small (smaller than 0.2) or large (larger than 0.5).

Figure 2
figure 2

Power for replication-based analysis (replication) and joint analysis (joint) when MAFs in two stages are unequal. The number of cases and controls are equal to 2000. The genome-wide significance level per SNP is 5 × 10–8 and π is the proportion of subjects genotyped in Stage I. The genotype relative risk for one and two copy risk allele is equal to 3 and the disease prevalence is 0.1. MAFs of Stage I and Stage II are equal to 0.005 and 0.01, respectively. A full color version of this figure is available at the Journal of Human Genetics journal online.

A real data example: rheumatoid arthritis

Rheumatoid arthritis is a chronically autoimmune inflammatory disease that mainly attacks synovial joints. It affects about 1% of the common adult population worldwide.29 MacGregor et al.30 pointed out that the genetic variants might have a major role in RA susceptibility. Several GWAS and candidate gene association studies among US populations and European populations have successfully identified multiple high-risk RA loci.31, 32, 33, 34 Here, we apply the proposed approaches to a rare SNP rs5029937 (major allele G; minor allele T), which displayed a significant association with RA in two independent case–control genetic association studies.32, 34 It is noted that we treat the genotype counts of rs5029937 from Orozco et al.32 (3962 RA cases and 3531 controls) as data in Stage I and those from Stahl et al.34 (5539 RA cases and 20 169 controls) in Stage II. The detailed genotype counts of rs5029937 in RA cases and controls, and P-values under two testing methods are given in Table 3, which suggest that SNP rs5029937 is strongly associated with RA (PReplication=4.49 × 10−9 and PJoint=9.19 × 10−9), while the P-values, PTREND=0.001 and PGWAS=7.5 × 10−8 are reported in Orozco et al.32 and in Stahl et al.,34 respectively. It seems that the proposed testing strategies on the basis of beta test (Li et al.)27 show strong evidences in evaluating rare variants under a two-stage design.

Table 3 Genotype counts of rs5029937 and P-values for two proposed test methods

Discussion

Although the GWAS focusing on relatively common SNPs have achieved great successes in identifying loci related to disease susceptibility, mounting evidences have suggested that rare genetic variants could have an important role in the genetic makeup of common human diseases and traits, and some might have a much larger genetic effect than the ones that are typically observed on the relatively common genetic variants. Recent studies suggest that rare genetic variants might contribute to a considerable proportion of the disease heritability.13, 14, 16 With the recent advances in the genomic technologies, it is becoming increasingly feasible to systematically screen both common and rare genetic variants throughout the human genome in search for chromosomal loci underlying the disease susceptibility. With the advance made in the second-generation sequencing technology, it is becoming feasible for investigators to explore the association of rare genetic variants with complex diseases at the genome-wide scale.

A cost-effective multi-staged design is often adopted in the current GWAS focusing on common genetic variants. As for the study of common SNPs, we expect the multi-staged design to be widely used for the genome-wide study of rare genetic variants. However, it requires new statistical procedures as the assumptions made in original methods are not valid for rare variants. Here, we propose a new statistical procedure to evaluate the association between rare variants and disease outcomes under the multi-staged design. We conducted extensive simulation studies to validate the performances of the proposed method, and demonstrate that it is generally more efficient to combine data from all stages (the joint test) rather than to rely on the second-stage data only (the replication-based test), a conclusion similar to the one well known for the study of common SNPs in Skol et al. (2006).24

We also conduct simulation studies to compare the joint/replication-based analysis based on beta test with those based on Fisher’s exact test. Results not shown here indicate that joint/replication-based analysis based on beta test is more powerful than those based on Fisher’s exact test.

We also evaluated the performance of the proposed methods under the situation where the odds ratios for the disease-associated SNPs are different between the two populations from which the stage one and stage two samples are collected. Our simulation studies not indicated here show that the joint analysis, in general, is more powerful than the replication-based analysis under this situation. So, we recommend using the joint test to evaluate rare variants as an alternatively single-marker testing strategy in future two-staged GWAS or candidate gene association studies for rare variants.

We implement the proposed procedures using R version 2.13.2. R codes are freely requested from the first author.