Introduction

Genome-wide association studies that were first suggested a decade ago by Risch and Merikangas (1996) are being conducted to unravel the genetic etiology of complex human diseases (Klein et al. 2005; Thomas et al. 2005), enabled by rapidly decreasing genotyping costs, massively high throughout genotyping technologies, the large-scale SNP discovery and genotyping efforts of the SNP Consortium (Sachidanandam et al. 2001) and the international HapMap consortium (Hinds et al. 2005). Presently the two-stage design is a more efficient method for genome-wide association studies than one-stage design.

In one-stage design, all available samples are genotyped on all markers. In replication-based two-stage analysis, a dense set of SNP markers across the genome is genotyped and tested using a portion of the samples in stage 1, and, the most-promising markers are then genotyped and tested in the remaining samples in stage 2. Compared with these two designs, in a joint analysis for two-stage genome-wide association studies, the most promising markers identified in stage 1 are examined in all samples in stage 2. So, joint analysis is more efficient, and its power is nearly the same as that of the one-stage design, while substantially reducing genotyping costs (Satagopan et al. 2002, 2004; Satagopan and Elston 2003; Thomas et al. 2004; Skol et al. 2006). Now, the crucial and urgent task for two-stage genome-wide association analysis is to construct more powerful test statistics in order to make good use of data information and to develop more efficient methods.

The power of two-stage genome-wide association studies to identify variants underlying a complex disease depends on a number of factors, including how M markers are selected, how many samples (N) are selected, how samples (πsamples) are divided between stage 1 and stage 2, the proportion \((\pi_{{\rm markers}})\) of markers tested in stage 2 and strategy used to test for association, inheritable disease models, effect sizes of risk allele and disease prevalence and so on (Skol et al. 2006). So, for different parameters controlled, the relationships among \(\pi_{{\rm samples}}, \pi_{{\rm markers}}\) and N need to be determined in order to get the higher power and control the proper genome-wide type I error rate.

Skol et al. (2006) get that joint analysis using a linear function of risk allele frequencies in cases and controls is more powerful than replication-based analysis for two-stage genome-wide association studies. (This joint analysis in Skol et al. will be referred to as the “linear joint analysis” throughout the article.) This can be achieved easily because linear joint analysis examines the data from both stages 1 and 2 and not from only stage 2 in the second stage. However, the statistic in Skol et al. (2006) compares risk allele frequencies in cases and controls and uses a linear function of risk allele frequencies in cases and controls. A nonlinear function of risk allele frequencies in cases and controls is Shannon entropy. Shannon entropy, originally defined in statistical physics and information theory (Cover and Thomas 1991; Greiner et al. 1995), is used to measure the uncertainty removed or the information gained by performing an experiment. When it is applied to characterize DNA variation, entropy measures genetic diversity and extracts the maximal amount of information for a set of SNP markers (Hampe et al. 2003). The difference between cases and controls in entropy of the SNP markers is a measure of the association of the markers with diseases (Zhao et al. 2005). In this paper we propose a statistic based on entropy with high power for joint analysis for two-stage genome-wide association studies.

The main purpose of this article is to develop an entropy-based statistic with high power that is based on the nonlinear transformation of risk allele frequencies for joint analysis for two-stage genome-wide association studies. We compare the power among one stage analysis, linear joint analysis and entropy-based nonlinear joint analysis by simulation studies. To demonstrate that amplification of the differences in allele frequencies by a nonlinear test statistic will not cause false-positive problems, we investigate the type I error rates of the entropy-based nonlinear test statistic in a single-locus association test by simulations. Finally, we compare the power of linear joint analysis with that of entropy-based joint analysis when the same false discovery rate is controlled. From these results, we recommend we should use entropy-based joint analysis for genome-wide association studies.

Methods

We consider evaluating M markers using a case-control design, where the N cases and N controls are all unrelated individuals. Further, we assume that the markers are not in strong linkage disequilibrium with each other, hence the markers can be considered independent, and that the alleles in one locus are in Hardy–Weinberg equilibrium(HWE). We test every marker in a proportion (πsamples) of samples in stage 1 and select approximately \(M \times \pi_{{\rm markers}}\) markers for genotyping on the remaining N × (1 − πsamples) cases and controls in stage 2.

Denote Z 1 and Z 2 to be the test statistics of a marker at stage 1 and stage 2, and C 1 and C 2 to be the critical values for stage 1 and stage 2, respectively. We denote P 0 (·) and P A (·) to be the probabilities of an event under the null and alterative hypotheses, respectively. Then from Satagopan et al. (2002, 2003, 2004), we know that the false-positive rate for a marker when using a two-stage strategy is

$$\alpha_{{\rm markers}} = \alpha_{{\rm genome}} /M = P_{0} (Z_{1} > C_{1})P_{0} (Z_{2} > C_{2} |Z_{1} > C_{1}),$$

where αgenome is any desired genome-wide false-positive rate (type I error rate). The power of the two-stage strategy is the probability of selecting a disease locus under the alternative hypothesis, which is

$$\hbox{Power} = P_{A} (Z_{1} > C_{1}, Z_{2} > C_{2}) = P_{A} (Z_{1} > C_{1})P_{A} (Z_{2} > C_{2} |Z_{1} > C_{1}).$$

Denote \(\hat{P}^{A}_{1}\) and \(\hat{P}^{U}_{1}\) to be the estimated risk allele frequencies in cases and controls in stage 1, respectively. The test statistic is defined in Skol et al. (2006)

$$Z_{1} = \frac{\hat{P}_{1}^{A} - \hat{P}_{1}^{U}}{\sqrt {(\hat{P}_{1}^{A} (1 - \hat{P}_{1}^{A}) + \hat{P}_{1}^{U} (1 - \hat{P}_{1}^{U}))/(2N\pi_{{\rm samples}})}}.$$

Under the null hypothesis of no association, and when a large number of samples Nπsamples is genotyped in stage 1, Z 1 follows a normal distribution with mean 0 and variance 1. We can determine a threshold C 1 for selecting markers for follow-up such that \(P(|Z_{1} | > C_{1}) = \pi_{{\rm markers}}.\) Under the alternative hypothesis, the statistic Z 1 in large samples follows an approximate normal distribution with mean

$$\mu_{1} = \frac{P^{A} - P^{U}}{\sqrt {(P^{A} (1 - P^{A}) + P^{U} (1 - P^{U}))/(2N\pi_{{\rm samples}})}}$$

and variance 1, where P A and P U are the risk allele frequencies in cases and controls.

Entropy-based joint analysis

In statistical physics and information theory, entropy measures the uncertainty of random variables or the degree of non-structure within a system (Cover and Thomas 1991; Greiner et al. 1995). The entropy of a discrete variable or a system X is defined as:

$$S(X) = - {\sum\limits_i {p(x_{i})}}\log p(x_{i}),$$

where p(x i) = Prob(X = x i). Entropy can be used to measure DNA variations at disease genes underlying a complex disease (Ackerman et al. 2003; Hampe et al. 2003; Zhao et al. 2005).

The entropies of risk allele at one marker in cases and controls are defined as S A = − p A log p A and S U = − p U log p U, respectively.

Then the new entropy-based test statistic for an association test is defined as:

$$Z^{e} = \frac{\hat{S}^{A} - \hat{S}^{U}}{\sqrt{ \frac{{{\hat{P}^{A}} (1 - \hat{P}^{A})(1 + \log \hat{P}^{A})^{2}}}{2N^{A}} + \frac{{{\hat{P}^{U}} (1 - \hat{P}^{U})(1 + \log \hat{P}^{U})^{2}}}{2N^{U}}}},$$

where \(\hat{S}^{A},\hat{S}^{U},\hat{P}^{A}, \hat{P}^{U}\) are the estimators of S A, S U, P A, and P U, respectively.

From theorem 1.9 (Lehmann 1983), we know that the statistic Z e is asymptotically distributed as a normal distribution with mean 0 and variance 1 under the null hypothesis of no association. Under the alterative hypothesis of association, Z e is asymptotically distributed as a normal distribution with mean

$$\mu^{e} = \frac{S^{A} - {S}^{U}}{\sqrt{\frac{{P^{A} (1 -{P}^{A})(1 + \log{P}^{A})^{2}}}{2N^{A}} + \frac{{P^{U} (1 -{P}^{U})(1 + \log{ P}^{U})^{2}}}{2N^{U}}}},$$

and variance 1.

In stage 1, the statistic for entropy-based joint analysis when N UN AN is

$$Z^{e}_{1} = \frac{\sqrt {2N\pi_{{\rm samples}}}(\hat{S}^{A} - \hat{S}^{U})}{\sqrt {\hat{P}^{A} (1 - \hat{P}^{A})(1 + \log \hat{P}^{A})^{2} + \hat{P}^{U} (1 - \hat{P}^{U})(1 + \log \hat{P}^{U})^{2}}}.$$

Under the null hypothesis of no association, and when a large number of samples Nπsamples is genotyped in stage 1, Z e1 follows a normal distribution with mean 0 and variance 1. The threshold \(C_{1}^{e} = \Phi^{{- 1}} (1 - \pi_{{\rm markers}} /2)\) is determined for selecting markers for follow-up genotyping. So the probability that a marker will be selected for stage 2 genotyping is

$$P_{1}^{e} = 1 - \Phi (C_{1}^{e} - \mu_{1}^{e}) + \Phi (- C_{1}^{e} - \mu_{1}^{e}),$$

where \(\mu^{e}_{1} = \frac{{{\sqrt {2N\pi_{{samples}}}}(S^{A} - S^{U})}}{{{\sqrt {P^{A} (1 - P^{A})(1 + \log P^{A})^{2} + P^{U} (1 - P^{U})(1 + \log P^{U})^{2}}}}}.\) Similarly, an analogous statistic Z e2 is calculated using genotype data only from stage 2.

In stage 2, the entropy-based statistic \(Z^{e}_{{jo\operatorname{int}}}\) is calculated using genotype data from both stages 1 and 2 as

$$Z_{jo\operatorname{int}}^{e} = {\sqrt {\pi_{{\rm samples}}}}Z_{1}^{e} + {\sqrt {1 - \pi_{{\rm samples}}}}Z_{2}^{e} .$$

Conditional on the observed stage 1 statistic Z e1 = x, the statistic for joint analysis \(Z^{e}_{{jo\operatorname{int}}}\) follows an approximate normal distribution in large samples with mean

$$\mu_{jo\operatorname{int}}^{e} = \frac{{\sqrt {2N}}(S^{A} - S^{U})}{\sqrt {P^{A} (1 - P^{A})(1 + \log P^{A})^{2} + P^{U} (1 - P^{U})(1 + \log P^{U})^{2}}} + {\sqrt {\pi_{\rm samples}}}(x - \mu^{e}_{1})$$

and variance 1 − πsamples. Under the null hypothesis of no association, \(\mu^{e}_{{jo\operatorname{int}}} = {\sqrt {\pi_{{\rm samples}}}}x.\) The critical value \(C^{e}_{{jo\operatorname{int}}}\) can be calculated iteratively by finding the threshold that satisfies \(P_{0} ({\left| {Z^{e}_{{jo\operatorname{int}}}} \right|} > C^{e}_{{jo\operatorname{int}}} |{\left| {Z^{e}_{1}} \right|} > C^{e}_{1}) = {\alpha_{{\rm genome}}} \mathord{\left/ {\vphantom {{\alpha_{{\rm genome}}} {(M\pi_{{mar\ker s}})}}} \right. \kern-\nulldelimiterspace} {(M\pi_{{\rm markers}})}.\) The probability of detecting association in stage 2 in an entropy-based joint analysis is

$$\begin{aligned} P_{jo\operatorname{int}}^{e} &= P_{A} ({\left| {Z_{jo\operatorname{int}}^{e}} \right|} > C_{jo\operatorname{int}}^{e} |{\left| {Z_{1}^{e}} \right|} > C_{1}^{e}) \\ &= {\int_{C_{1}^{e}}^\infty {P_{A} ({\left| {Z_{jo\operatorname{int}}^{e}} \right|} > C_{jo\operatorname{int}}^{e} |{\left| {Z_{1}^{e}} \right|} = x)}}f(x|{\left| {Z_{1}^{e}} \right|} > C_{1}^{e})dx \\\, & \quad + {\int_{- \infty}^{- C_{1}^{e}} {P_{A} \left({\left| {Z_{jo\operatorname{int}}^{e}} \right|} > C_{jo\operatorname{int}}^{e} |{\left| {Z_{1}^{e}} \right|} = x\right)}}f\left(x|{\left| {Z_{1}^{e}} \right|} > C_{1}^{e}\right)dx \\\,& = {\int_{|x| > C_{1}^{e}} {\int_{{\left| y \right|} > C_{jo\operatorname{int}}^{e}} \frac{1}{2\pi {\sqrt {1 - \pi_{{\rm samples}}}}[1 - \Phi (C_{1}^{e} - \mu_{1}^{e}) + \Phi (- C_{1}^{e} - \mu^{e}_{1})]}}} \\\,& \quad \quad \exp \left(- \frac{(y - \mu^{e}_{0})^{2} - 2\pi_{{\rm samples}} (x - \mu^{e}_{1})(y - \mu^{e}_{0}) + (x - \mu_{1}^{e})^{2}}{2(1 - \pi_{{\rm samples}})}\right)dxdy, \\ \end{aligned} $$

where

$$\mu_{0}^{e} = \frac{{\sqrt {2N}}(S^{A} - S^{U})}{\sqrt {P^{A} (1 - P^{A})(1 + \log P^{A})^{2} + P^{U} (1 - P^{U})(1 + \log P^{U})^{2}}}.$$

The power of entropy-based joint analysis for two-stage genome-wide association studies is

$${\int\limits_{|x| > C^{e}_{1}} {\int\limits_{{\left| y \right|} > C^{e}_{{jo\operatorname{int}}}} \frac{1}{2\pi {\sqrt {1 - \pi_{{\rm samples}}}}}}}\exp \left(- \frac{(y - \mu^{e}_{0})^{2} - 2\pi_{{\rm samples}} (x - \mu^{e}_{1})(y - \mu^{e}_{0}) + (x - \mu^{e}_{1})^{2}}{2(1 - \pi_{{\rm samples}})}\right)dxdy.$$

Disease models

Denote genotype relative risks (GRRs) to be R 1 and R 2, the disease prevalence to be Prev, and the risk allele frequency to be P d. Then the disease penetrances \(f_{i} = P(\hbox{affected}|i\;\hbox{copies of}\;d\;\hbox{allele})\; (0 \leq i \leq 2)\) are

$$f_{0} = {\it Prev}/((1 - P_{d})^{2} + 2P_{d} (1 - P_{d})R_{1} + P^{2}_{d} R_{2}), \quad f_{1} = f_{0} R_{1}, \enspace f_{2} = f_{0} R_{2}.$$

Therefore

$$P(dd|\hbox{Cases}) = \frac{{f_{2} P^{2}_{d}}}{{\it Prev}},$$
$$P(Dd|\hbox{Cases}) = \frac{{2f_{1} (1 - P_{d})P_{d}}}{{\it Prev}},$$
$$\begin{aligned}\,& P(DD|\hbox{Cases}) = \frac{{f_{0} (1 - P_{d})^{2}}}{{\it Prev}}; \\\,& \\ \end{aligned} $$
$$P(dd|\hbox{Controls}) = \frac{{(1 - f_{2})P^{2}_{d}}}{{1 - {\it Prev}}},$$
$$P(Dd|\hbox{Controls}) = \frac{{2(1 - f_{1})(1 - P_{d})P_{d}}}{{1 - {\it Prev}}},$$
$$\begin{aligned} & P(DD|\hbox{Controls}) = \frac{{(1 - f_{0})(1 - P_{d})^{2}}}{{1 - {\it Prev}}}. \\ & \\ \end{aligned} $$

So, risk allele frequencies at a marker locus in cases and controls can be achieved respectively as

$$\frac{f_{2} P^{2}_{d}}{{\it Prev}} + \frac{f_{1} (1 - P_{d})P_{d}}{{\it Prev}}\enspace\hbox{and}\enspace\frac{(1 - f_{2})P^{2}_{d}}{1 - {\it Prev}} + \frac{(1 - f_{1})(1 - P_{d})P_{d}}{1 - {\it Prev}}.$$

Results

In order to apply the entropy-based statistic to genome-wide association studies, we first examined the property of this test statistic in the simple case, single-locus case-control association studies. In the methods, we have shown that when the sample size is large enough to apply large-sample theory, the distribution of the entropy-based statistic under the null hypothesis of no association is asymptotically a normal distribution. To examine whether the asymptotic result of the entropy-based test statistic still holds for a small sample size, 200 individuals were randomly generated. A total of 10,000 simulations were performed. In each simulation, we calculated the entropy-based test statistic Z e.

Table 1 summarizes the estimated type I error rates of the test statistic Z e for sample sizes from 100 to 500 individuals for association test. It shows that the estimated type I error rates of the test statistic Z e are not appreciably different from the nominal levels α = 0.05, α = 0.01, and α = 0.005. Table 2 summarizes the power of the entropy-based statistic in single-locus association studies for sample sizes from 100 to 500 individuals, using a multiplicative model with R 1 = 1.60 and R 2 = 2.56 and disease prevalence of 0.10. It shows that the power of entropy-based test is higher than that using a linear function of risk allele frequency.

Table 1 Evaluated type I error rates for the test statistic Z e in single-locus association test (10,000 simulations)
Table 2 Power of the test statistic Z e in single-locus association study (10,000 simulations)

Now we apply the new statistic to joint analysis for two-stage genome-wide association studies. First, we compare the powers of one-stage, linear joint analysis and entropy-based nonlinear joint analysis at αgenome = 0.05 for a wide range of proportions (πsamples) of samples in stage 1, proportions \((\pi_{{\rm markers}})\) of markers selected for follow-up genotyped in stage 2 and four different genetic models with risk allele frequencies 0.05 and 0.10. All the results show that entropy-based nonlinear joint analysis is more powerful and a more efficient design for genome-wide association studies (Figs. 1, 2).

Fig. 1
figure 1

Power of linear and entropy-based joint analyses with 2,000 cases and 2,000 controls genotyped on 300,000 independent markers with αgenome = 0.05. Using a multiplicative genetic model with R 1 = 1.40 and R 2 = 1.96 and disease prevalence of 0.10

Fig. 2
figure 2

Power of linear and entropy-based joint analyses with 2,000 cases and 2,000 controls genotyped on 300,000 independent markers with αgenome = 0.05 under four different genetic models and disease prevalence of 0.10. Dominant model: R 1 = 1.60, R 2 = 1.60; recessive model: R 1 = 1, R 2 = 6; multiplicative model: R 1 = 1.60, R 2 = 2.56; additive model: R 1 = 1.50, R 2 = 2

We then investigate the power of nonlinear joint analysis as a function both of frequencies of risk allele under multiplicative genetic models (Fig. 3) and of proportions \((\pi_{{\rm markers}})\) of markers selected for follow-up detection in stage 2 (Fig. 4). We find that the power of the entropy-based joint analysis is always higher than linear joint analysis when the frequency of risk allele is small. However, as the frequency of risk allele increases, the powers of these two joint analyses are comparable.

Fig. 3
figure 3

Power of linear and entropy-based joint analyses as a function of the frequencies of risk allele with 2,000 cases and 2,000 controls genotyped on 300,000 markers with αgenome = 0.05; it uses two multiplicative genetic models (R 1 =1.50, R 2 = 2.25  and  R 1 = 1.70, R 2 = 2.89) and disease prevalence of 0.10

Fig. 4
figure 4

Power of linear and entropy-based joint analyses as a function of \(\pi_{{\rm markers}}\) with 2,000 cases and 2,000 controls genotyped on 300,000 markers with αgenome = 0.05; it uses dominant (R 1 = 1.60, R 2 = 1.60) and multiplicative (R 1 = 1.40, R 2 = 1.96) genetic models with disease prevalence of 0.10 and risk allele frequency of 0.10

We also investigate the samples sizes needed to detect the genetic variants with different effect sizes (Fig. 5) by linear and entropy-based joint analyses in two-stage design. The sample size needed for entropy-based joint analysis is less than that needed for linear joint analysis to get the same power. For obtaining power of 80%, for the genetic variants P d =  0.10 with modest and large effect sizes at a multiplicative model, we suppose the sample size are respectively to be 2,540 and 1,227 for genetic variants with effect sizes GRR = 1.4 and 1.6, respectively.

Fig. 5
figure 5

Power of linear and entropy-based joint analyses with variant sample sizes on 300,000 independent markers with αgenome =  0.05. It uses two multiplicative models with R 1 = 1.40, R 2 = 1.96 and  R 1 = 1.60, R 2 = 2.56 respectively and prevalence = 0.10, risk allele frequency of 0.10

When controlling the false discovery rate, we compare the power of linear and entropy-based joint analyses in two-stage genome-wide association studies as a function of the difference of risk allele frequencies between cases and controls (Fig. 6). The power of entropy-based joint analysis is higher than that of the linear joint analysis controlling the same false discovery rate when detecting the genetic variants with a small frequency. It makes sense if we want to attain the same power for two joint analyses, then the false positive rate of linear joint analysis will increase. For example, the false-positive rate increases from 0.05 to nearly 0.10 when the same power in two joint analyses is achieved for \(\pi_{{\rm samples}} = 0.30, \pi_{{\rm markers}} = 0.01,\) GRR = 1.60, and P AP U =  0.04.

Fig. 6
figure 6

Power of linear and entropy-based joint analyses under controlling the same false discovery rate, with sample size 2,000 cases and 2,000 controls on 300,000 independent markers. It uses dominant (R 1 = 1.60, R 2 = 1.60) and multiplicative (R 1 = 1.60, R 2 = 2.56) models, prevalence = 0.10, πsamples = 0.30 and \(\pi_{{\rm markers}}=0.01\)

In Table 3, we compare the power of entropy-based joint analysis with that of the linear joint analysis under four different genetic models. We find that the powers of entropy-based joint analysis are 2% higher than that of the linear joint analysis under the risk allele frequency of 0.05 by simulations. In Table 4, we evaluate the sample size needed in entropy-based joint analysis, and it shows that there are fewer samples needed in entropy-based joint analysis than that needed in linear joint analysis. In Table 5, we compare the power of linear and entropy-based joint analyses when controlling the false discovery rate for the fixed allele frequency difference between cases and controls. We can find that the false discovery rate of linear joint analysis increases from 0.05 to 0.1 when getting a power of 0.93 compared with the entropy-based joint analysis. All results show that the entropy-based analysis is more powerful and needs fewer samples for attaining the same power and achieving the same false discovery rate. These make sense, as entropy-based joint analysis uses a nonlinear function of risk allele frequencies so that it makes full use of data information from all samples.

Table 3 Power of entropy-based joint analyses for two-stage genome-wide association studies under four genetic models
Table 4 Sample size to attain the desired significance level of two-stage genome-wide design 0.05 and power of 80% for various rare allele frequency differences and population allele frequencies
Table 5 Power of entropy-based joint analyses for two-stage genome-wide association studies when controlling FDR

Discussion

We have shown that the entropy-based joint analysis for two-stage genome-wide association design is a more efficient and more powerful strategy to identify genetic variants with variant effect sizes associated with a disease when testing a large number of markers using unrelated case-control samples. For achieving an overall power of 90% when detecting genetic variants both with small frequency and with small to large effects, the sample size needed in entropy-based joint analysis is about 30 fewer than that needed in linear joint analysis.

Genome-wide disease-association mapping has been herald as the study design of the next generation (Marchini et al. 2005); two-stage designs have been a promising strategy for genome-wide association studies, but the lack of analytical methods to use genotype data fully and sufficiently is a large stumbling block (Lin et al. 2004). So, we should commit ourselves to find more powerful and more efficient methods (or statistics) in the near future. The traditional test statistic in Skol et al. (2006) is a linear function of (P AP U) in risk allele frequencies between cases and controls. Here, we introduce a nonlinear function of risk allele frequencies in cases and controls, entropy, (S AS U) to develop novel test statistics with high power for detecting the genetic variants underlying the disease.

We investigate the distribution of a nonlinear entropy-based statistic under the null hypothesis by simulation studies. To validate the test statistic, we calculate the type I error rates of the entropy-based statistic by simulations. It shows that the type I error rates of the entropy-based statistic are close to the nominal significance levels. To evaluate the performance of the entropy-based joint analysis, we compare the power of the entropy-based statistic in single-locus association study with that of the statistic using linear function of risk allele frequencies in cases and controls by simulations. The results show that the entropy-based statistic has a higher power than the statistic using the linear function of risk allele frequencies in cases and controls. However, since the power of the statistic is a complex issue, there is not one statistic that is uniformly more powerful (Zhao et al. 2006). The entropy-based analysis is also not more powerful in all situations. When a large difference of rare risk allele frequencies between cases and controls appears, that is, |P AP U | > 0.07, the linear joint analysis is more powerful than the entropy-based joint analysis when detecting rare genetic variants with variant genetic effects. However, these differences in rare risk allele frequencies between cases and controls are practically unrealistic in real-world studies of rare variants/common diseases.

Subsequently, we apply the entropy-based statistic to two-stage genome-wide association studies. We compare the power of entropy-based nonlinear joint analysis with that of the linear joint analysis by simulations. The results show that the power of the entropy-based joint analysis is higher than the power of the linear joint analysis in most cases when detecting rare genetic variants with variant genetic effects. However, entropy is one of the nonlinear transformations of risk allele frequencies between cases and controls. The general forms of nonlinear transformations f(P A, P U) of risk allele frequencies in cases and controls should be investigated in the future.

Here we have described entropy-based joint analysis for two-stage genome-wide association studies using independent genetic markers. But this assumption will be violated when some markers are in linkage disequilibrium. For two genetic variants each with a small effect, we should consider the interaction between loci in genome-wide association studies when they contribute modest or large effects in combination. This will be an inevitable and promising field for genome-wide association studies.

The simulations show that for a given sample size, we should genotype half of the individuals on all markers in the first stage and select the 5% of markers for follow-up genotyping in the second stage using the entropy-based statistic, which provides a practical cost-effective strategy to search for rare genetic variants in association studies. The simulations also show that for searching for rare genetic variants with moderate effects (R 1 = 1.4, 1.6), the sample size is approximately 2,000 for the fixed rare allele frequency difference (4%, 5%) by using the entropy-based joint analysis.

In multiple tests, there is an increasing trend to use a false discovery rate as a measure of global error instead of using overall type I error rate. This article compares the power of entropy-based joint analysis with that of linear joint analysis controlling the false discovery rate when its level is set to be the same, which is usually done in the literature (Benjamini and Hochberg 1995; Zou and Zuo 2006; Zuo et al. 2006). The results also show that entropy-based joint analysis leads to higher power than linear joint analysis when controlling the same false discovery rate, which makes sense.

In conclusion, numerous genome-wide association studies for a range of diseases are being planned or are already underway. Developing new statistical methods that can deal with such large-scale studies is urgently needed to explore the etiology of complex diseases. Two-stage designs are more efficient and powerful, comparable to the one stage design. The results in this paper show that the entropy-based joint analyses are more powerful and need fewer samples for attaining the same power and achieving the same false discovery rate. Therefore, we suggest that we should use entropy-based joint analysis for two-stage genome-wide association studies.