Entropy-based joint analysis for two-stage genome-wide association studies

Kang, Guolian; Zuo, Yijun

doi:10.1007/s10038-007-0177-7

Download PDF

Original Article
Published: 01 September 2007

Entropy-based joint analysis for two-stage genome-wide association studies

Guolian Kang¹ &
Yijun Zuo¹

Journal of Human Genetics volume 52, pages 747–756 (2007)Cite this article

538 Accesses
10 Citations
Metrics details

Abstract

Genome-wide association studies (GWAS) are being conducted to identify common genetic variants that predispose to human diseases to unravel the genetic etiology of complex human diseases now. Because of genotyping cost constraints, it often follows a two-stage design, in which a large number of markers are identified in a proportion of the available samples in stage 1, and then the markers identified in stage 1 are examined in all the samples in stage 2. In this paper, we introduce a nonlinear entropy-based statistic for joint analysis for two-stage genome-wide association studies. Type I error rates and power of the entropy-based statistic for association tests are validated using simulation studies in single-locus test. The power of entropy-based joint analysis is investigated by simulations. And the results suggest that entropy-based joint analysis is always more powerful than linear joint analysis that uses a linear function of risk allele frequencies in cases and controls when detecting rare genetic variants; the powers of these two joint analyses are comparable when detecting common genetic variants. Furthermore, when the false discovery rate is controlled, entropy-based joint analysis is more powerful and needs fewer samples than linear joint analysis that uses a linear function of risk allele frequencies in cases and controls. So, we recommend we should use entropy-based strategy for two-stage genome-wide association studies to detect the rare and common genetic variants with moderate to large genetic effect underlying a complex disease.

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Genome-wide association studies

Article 26 August 2021

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Article Open access 12 April 2024

Introduction

Genome-wide association studies that were first suggested a decade ago by Risch and Merikangas (1996) are being conducted to unravel the genetic etiology of complex human diseases (Klein et al. 2005; Thomas et al. 2005), enabled by rapidly decreasing genotyping costs, massively high throughout genotyping technologies, the large-scale SNP discovery and genotyping efforts of the SNP Consortium (Sachidanandam et al. 2001) and the international HapMap consortium (Hinds et al. 2005). Presently the two-stage design is a more efficient method for genome-wide association studies than one-stage design.

In one-stage design, all available samples are genotyped on all markers. In replication-based two-stage analysis, a dense set of SNP markers across the genome is genotyped and tested using a portion of the samples in stage 1, and, the most-promising markers are then genotyped and tested in the remaining samples in stage 2. Compared with these two designs, in a joint analysis for two-stage genome-wide association studies, the most promising markers identified in stage 1 are examined in all samples in stage 2. So, joint analysis is more efficient, and its power is nearly the same as that of the one-stage design, while substantially reducing genotyping costs (Satagopan et al. 2002, 2004; Satagopan and Elston 2003; Thomas et al. 2004; Skol et al. 2006). Now, the crucial and urgent task for two-stage genome-wide association analysis is to construct more powerful test statistics in order to make good use of data information and to develop more efficient methods.

The power of two-stage genome-wide association studies to identify variants underlying a complex disease depends on a number of factors, including how M markers are selected, how many samples (N) are selected, how samples (π_samples) are divided between stage 1 and stage 2, the proportion $(\pi_{{\rm markers}})$ of markers tested in stage 2 and strategy used to test for association, inheritable disease models, effect sizes of risk allele and disease prevalence and so on (Skol et al. 2006). So, for different parameters controlled, the relationships among $\pi_{{\rm samples}}, \pi_{{\rm markers}}$ and N need to be determined in order to get the higher power and control the proper genome-wide type I error rate.

Skol et al. (2006) get that joint analysis using a linear function of risk allele frequencies in cases and controls is more powerful than replication-based analysis for two-stage genome-wide association studies. (This joint analysis in Skol et al. will be referred to as the “linear joint analysis” throughout the article.) This can be achieved easily because linear joint analysis examines the data from both stages 1 and 2 and not from only stage 2 in the second stage. However, the statistic in Skol et al. (2006) compares risk allele frequencies in cases and controls and uses a linear function of risk allele frequencies in cases and controls. A nonlinear function of risk allele frequencies in cases and controls is Shannon entropy. Shannon entropy, originally defined in statistical physics and information theory (Cover and Thomas 1991; Greiner et al. 1995), is used to measure the uncertainty removed or the information gained by performing an experiment. When it is applied to characterize DNA variation, entropy measures genetic diversity and extracts the maximal amount of information for a set of SNP markers (Hampe et al. 2003). The difference between cases and controls in entropy of the SNP markers is a measure of the association of the markers with diseases (Zhao et al. 2005). In this paper we propose a statistic based on entropy with high power for joint analysis for two-stage genome-wide association studies.

The main purpose of this article is to develop an entropy-based statistic with high power that is based on the nonlinear transformation of risk allele frequencies for joint analysis for two-stage genome-wide association studies. We compare the power among one stage analysis, linear joint analysis and entropy-based nonlinear joint analysis by simulation studies. To demonstrate that amplification of the differences in allele frequencies by a nonlinear test statistic will not cause false-positive problems, we investigate the type I error rates of the entropy-based nonlinear test statistic in a single-locus association test by simulations. Finally, we compare the power of linear joint analysis with that of entropy-based joint analysis when the same false discovery rate is controlled. From these results, we recommend we should use entropy-based joint analysis for genome-wide association studies.

Methods

We consider evaluating M markers using a case-control design, where the N cases and N controls are all unrelated individuals. Further, we assume that the markers are not in strong linkage disequilibrium with each other, hence the markers can be considered independent, and that the alleles in one locus are in Hardy–Weinberg equilibrium(HWE). We test every marker in a proportion (π_samples) of samples in stage 1 and select approximately $M \times \pi_{{\rm markers}}$ markers for genotyping on the remaining N × (1 − π_samples) cases and controls in stage 2.

Denote Z ₁ and Z ₂ to be the test statistics of a marker at stage 1 and stage 2, and C ₁ and C ₂ to be the critical values for stage 1 and stage 2, respectively. We denote P ₀ (·) and P _A (·) to be the probabilities of an event under the null and alterative hypotheses, respectively. Then from Satagopan et al. (2002, 2003, 2004), we know that the false-positive rate for a marker when using a two-stage strategy is

$$\alpha_{{\rm markers}} = \alpha_{{\rm genome}} /M = P_{0} (Z_{1} > C_{1})P_{0} (Z_{2} > C_{2} |Z_{1} > C_{1}),$$

where α_genome is any desired genome-wide false-positive rate (type I error rate). The power of the two-stage strategy is the probability of selecting a disease locus under the alternative hypothesis, which is

$$\hbox{Power} = P_{A} (Z_{1} > C_{1}, Z_{2} > C_{2}) = P_{A} (Z_{1} > C_{1})P_{A} (Z_{2} > C_{2} |Z_{1} > C_{1}).$$

Denote $\hat{P}^{A}_{1}$ and $\hat{P}^{U}_{1}$ to be the estimated risk allele frequencies in cases and controls in stage 1, respectively. The test statistic is defined in Skol et al. (2006)

$$Z_{1} = \frac{\hat{P}_{1}^{A} - \hat{P}_{1}^{U}}{\sqrt {(\hat{P}_{1}^{A} (1 - \hat{P}_{1}^{A}) + \hat{P}_{1}^{U} (1 - \hat{P}_{1}^{U}))/(2N\pi_{{\rm samples}})}}.$$

Under the null hypothesis of no association, and when a large number of samples Nπ_samples is genotyped in stage 1, Z ₁ follows a normal distribution with mean 0 and variance 1. We can determine a threshold C ₁ for selecting markers for follow-up such that $P(|Z_{1} | > C_{1}) = \pi_{{\rm markers}}.$ Under the alternative hypothesis, the statistic Z ₁ in large samples follows an approximate normal distribution with mean

$$\mu_{1} = \frac{P^{A} - P^{U}}{\sqrt {(P^{A} (1 - P^{A}) + P^{U} (1 - P^{U}))/(2N\pi_{{\rm samples}})}}$$

and variance 1, where P ^A and P ^U are the risk allele frequencies in cases and controls.

Entropy-based joint analysis

In statistical physics and information theory, entropy measures the uncertainty of random variables or the degree of non-structure within a system (Cover and Thomas 1991; Greiner et al. 1995). The entropy of a discrete variable or a system X is defined as:

$$S(X) = - {\sum\limits_i {p(x_{i})}}\log p(x_{i}),$$

where p(x _i) = Prob(X = x _i). Entropy can be used to measure DNA variations at disease genes underlying a complex disease (Ackerman et al. 2003; Hampe et al. 2003; Zhao et al. 2005).

The entropies of risk allele at one marker in cases and controls are defined as S ^A = − p ^A log p ^A and S ^U = − p ^U log p ^U, respectively.

Then the new entropy-based test statistic for an association test is defined as:

$$Z^{e} = \frac{\hat{S}^{A} - \hat{S}^{U}}{\sqrt{ \frac{{{\hat{P}^{A}} (1 - \hat{P}^{A})(1 + \log \hat{P}^{A})^{2}}}{2N^{A}} + \frac{{{\hat{P}^{U}} (1 - \hat{P}^{U})(1 + \log \hat{P}^{U})^{2}}}{2N^{U}}}},$$

where $\hat{S}^{A},\hat{S}^{U},\hat{P}^{A}, \hat{P}^{U}$ are the estimators of S ^A, S ^U, P ^A, and P ^U, respectively.

From theorem 1.9 (Lehmann 1983), we know that the statistic Z ^e is asymptotically distributed as a normal distribution with mean 0 and variance 1 under the null hypothesis of no association. Under the alterative hypothesis of association, Z ^e is asymptotically distributed as a normal distribution with mean

$$\mu^{e} = \frac{S^{A} - {S}^{U}}{\sqrt{\frac{{P^{A} (1 -{P}^{A})(1 + \log{P}^{A})^{2}}}{2N^{A}} + \frac{{P^{U} (1 -{P}^{U})(1 + \log{ P}^{U})^{2}}}{2N^{U}}}},$$

and variance 1.

In stage 1, the statistic for entropy-based joint analysis when N ^U = N ^A = N is

$$Z^{e}_{1} = \frac{\sqrt {2N\pi_{{\rm samples}}}(\hat{S}^{A} - \hat{S}^{U})}{\sqrt {\hat{P}^{A} (1 - \hat{P}^{A})(1 + \log \hat{P}^{A})^{2} + \hat{P}^{U} (1 - \hat{P}^{U})(1 + \log \hat{P}^{U})^{2}}}.$$

Under the null hypothesis of no association, and when a large number of samples Nπ_samples is genotyped in stage 1, Z ^e₁ follows a normal distribution with mean 0 and variance 1. The threshold $C_{1}^{e} = \Phi^{{- 1}} (1 - \pi_{{\rm markers}} /2)$ is determined for selecting markers for follow-up genotyping. So the probability that a marker will be selected for stage 2 genotyping is

$$P_{1}^{e} = 1 - \Phi (C_{1}^{e} - \mu_{1}^{e}) + \Phi (- C_{1}^{e} - \mu_{1}^{e}),$$

where $\mu^{e}_{1} = \frac{{{\sqrt {2N\pi_{{samples}}}}(S^{A} - S^{U})}}{{{\sqrt {P^{A} (1 - P^{A})(1 + \log P^{A})^{2} + P^{U} (1 - P^{U})(1 + \log P^{U})^{2}}}}}.$ Similarly, an analogous statistic Z ^e₂ is calculated using genotype data only from stage 2.

In stage 2, the entropy-based statistic $Z^{e}_{{jo\operatorname{int}}}$ is calculated using genotype data from both stages 1 and 2 as

$$Z_{jo\operatorname{int}}^{e} = {\sqrt {\pi_{{\rm samples}}}}Z_{1}^{e} + {\sqrt {1 - \pi_{{\rm samples}}}}Z_{2}^{e} .$$

Conditional on the observed stage 1 statistic Z ^e₁ = x, the statistic for joint analysis $Z^{e}_{{jo\operatorname{int}}}$ follows an approximate normal distribution in large samples with mean

$$\mu_{jo\operatorname{int}}^{e} = \frac{{\sqrt {2N}}(S^{A} - S^{U})}{\sqrt {P^{A} (1 - P^{A})(1 + \log P^{A})^{2} + P^{U} (1 - P^{U})(1 + \log P^{U})^{2}}} + {\sqrt {\pi_{\rm samples}}}(x - \mu^{e}_{1})$$

and variance 1 − π_samples. Under the null hypothesis of no association, $\mu^{e}_{{jo\operatorname{int}}} = {\sqrt {\pi_{{\rm samples}}}}x.$ The critical value $C^{e}_{{jo\operatorname{int}}}$ can be calculated iteratively by finding the threshold that satisfies $P_{0} ({\left| {Z^{e}_{{jo\operatorname{int}}}} \right|} > C^{e}_{{jo\operatorname{int}}} |{\left| {Z^{e}_{1}} \right|} > C^{e}_{1}) = {\alpha_{{\rm genome}}} \mathord{\left/ {\vphantom {{\alpha_{{\rm genome}}} {(M\pi_{{mar\ker s}})}}} \right. \kern-\nulldelimiterspace} {(M\pi_{{\rm markers}})}.$ The probability of detecting association in stage 2 in an entropy-based joint analysis is

$$\begin{aligned} P_{jo\operatorname{int}}^{e} &= P_{A} ({\left| {Z_{jo\operatorname{int}}^{e}} \right|} > C_{jo\operatorname{int}}^{e} |{\left| {Z_{1}^{e}} \right|} > C_{1}^{e}) \\ &= {\int_{C_{1}^{e}}^\infty {P_{A} ({\left| {Z_{jo\operatorname{int}}^{e}} \right|} > C_{jo\operatorname{int}}^{e} |{\left| {Z_{1}^{e}} \right|} = x)}}f(x|{\left| {Z_{1}^{e}} \right|} > C_{1}^{e})dx \\\, & \quad + {\int_{- \infty}^{- C_{1}^{e}} {P_{A} \left({\left| {Z_{jo\operatorname{int}}^{e}} \right|} > C_{jo\operatorname{int}}^{e} |{\left| {Z_{1}^{e}} \right|} = x\right)}}f\left(x|{\left| {Z_{1}^{e}} \right|} > C_{1}^{e}\right)dx \\\,& = {\int_{|x| > C_{1}^{e}} {\int_{{\left| y \right|} > C_{jo\operatorname{int}}^{e}} \frac{1}{2\pi {\sqrt {1 - \pi_{{\rm samples}}}}[1 - \Phi (C_{1}^{e} - \mu_{1}^{e}) + \Phi (- C_{1}^{e} - \mu^{e}_{1})]}}} \\\,& \quad \quad \exp \left(- \frac{(y - \mu^{e}_{0})^{2} - 2\pi_{{\rm samples}} (x - \mu^{e}_{1})(y - \mu^{e}_{0}) + (x - \mu_{1}^{e})^{2}}{2(1 - \pi_{{\rm samples}})}\right)dxdy, \\ \end{aligned} $$

where

$$\mu_{0}^{e} = \frac{{\sqrt {2N}}(S^{A} - S^{U})}{\sqrt {P^{A} (1 - P^{A})(1 + \log P^{A})^{2} + P^{U} (1 - P^{U})(1 + \log P^{U})^{2}}}.$$

The power of entropy-based joint analysis for two-stage genome-wide association studies is

$${\int\limits_{|x| > C^{e}_{1}} {\int\limits_{{\left| y \right|} > C^{e}_{{jo\operatorname{int}}}} \frac{1}{2\pi {\sqrt {1 - \pi_{{\rm samples}}}}}}}\exp \left(- \frac{(y - \mu^{e}_{0})^{2} - 2\pi_{{\rm samples}} (x - \mu^{e}_{1})(y - \mu^{e}_{0}) + (x - \mu^{e}_{1})^{2}}{2(1 - \pi_{{\rm samples}})}\right)dxdy.$$

Disease models

Denote genotype relative risks (GRRs) to be R ₁ and R ₂, the disease prevalence to be Prev, and the risk allele frequency to be P _d. Then the disease penetrances $f_{i} = P(\hbox{affected}|i\;\hbox{copies of}\;d\;\hbox{allele})\; (0 \leq i \leq 2)$ are

$$f_{0} = {\it Prev}/((1 - P_{d})^{2} + 2P_{d} (1 - P_{d})R_{1} + P^{2}_{d} R_{2}), \quad f_{1} = f_{0} R_{1}, \enspace f_{2} = f_{0} R_{2}.$$

Therefore

$$P(dd|\hbox{Cases}) = \frac{{f_{2} P^{2}_{d}}}{{\it Prev}},$$

$$P(Dd|\hbox{Cases}) = \frac{{2f_{1} (1 - P_{d})P_{d}}}{{\it Prev}},$$

$$\begin{aligned}\,& P(DD|\hbox{Cases}) = \frac{{f_{0} (1 - P_{d})^{2}}}{{\it Prev}}; \\\,& \\ \end{aligned} $$

$$P(dd|\hbox{Controls}) = \frac{{(1 - f_{2})P^{2}_{d}}}{{1 - {\it Prev}}},$$

$$P(Dd|\hbox{Controls}) = \frac{{2(1 - f_{1})(1 - P_{d})P_{d}}}{{1 - {\it Prev}}},$$

$$\begin{aligned} & P(DD|\hbox{Controls}) = \frac{{(1 - f_{0})(1 - P_{d})^{2}}}{{1 - {\it Prev}}}. \\ & \\ \end{aligned} $$

So, risk allele frequencies at a marker locus in cases and controls can be achieved respectively as

$$\frac{f_{2} P^{2}_{d}}{{\it Prev}} + \frac{f_{1} (1 - P_{d})P_{d}}{{\it Prev}}\enspace\hbox{and}\enspace\frac{(1 - f_{2})P^{2}_{d}}{1 - {\it Prev}} + \frac{(1 - f_{1})(1 - P_{d})P_{d}}{1 - {\it Prev}}.$$

Results

In order to apply the entropy-based statistic to genome-wide association studies, we first examined the property of this test statistic in the simple case, single-locus case-control association studies. In the methods, we have shown that when the sample size is large enough to apply large-sample theory, the distribution of the entropy-based statistic under the null hypothesis of no association is asymptotically a normal distribution. To examine whether the asymptotic result of the entropy-based test statistic still holds for a small sample size, 200 individuals were randomly generated. A total of 10,000 simulations were performed. In each simulation, we calculated the entropy-based test statistic Z ^e.

Table 1 summarizes the estimated type I error rates of the test statistic Z ^e for sample sizes from 100 to 500 individuals for association test. It shows that the estimated type I error rates of the test statistic Z ^e are not appreciably different from the nominal levels α = 0.05, α = 0.01, and α = 0.005. Table 2 summarizes the power of the entropy-based statistic in single-locus association studies for sample sizes from 100 to 500 individuals, using a multiplicative model with R ₁ = 1.60 and R ₂ = 2.56 and disease prevalence of 0.10. It shows that the power of entropy-based test is higher than that using a linear function of risk allele frequency.

Table 1 Evaluated type I error rates for the test statistic Z ^e in single-locus association test (10,000 simulations)

Full size table

Table 2 Power of the test statistic Z ^e in single-locus association study (10,000 simulations)

Full size table

Now we apply the new statistic to joint analysis for two-stage genome-wide association studies. First, we compare the powers of one-stage, linear joint analysis and entropy-based nonlinear joint analysis at α_genome = 0.05 for a wide range of proportions (π_samples) of samples in stage 1, proportions $(\pi_{{\rm markers}})$ of markers selected for follow-up genotyped in stage 2 and four different genetic models with risk allele frequencies 0.05 and 0.10. All the results show that entropy-based nonlinear joint analysis is more powerful and a more efficient design for genome-wide association studies (Figs. 1, 2).

We then investigate the power of nonlinear joint analysis as a function both of frequencies of risk allele under multiplicative genetic models (Fig. 3) and of proportions $(\pi_{{\rm markers}})$ of markers selected for follow-up detection in stage 2 (Fig. 4). We find that the power of the entropy-based joint analysis is always higher than linear joint analysis when the frequency of risk allele is small. However, as the frequency of risk allele increases, the powers of these two joint analyses are comparable.

We also investigate the samples sizes needed to detect the genetic variants with different effect sizes (Fig. 5) by linear and entropy-based joint analyses in two-stage design. The sample size needed for entropy-based joint analysis is less than that needed for linear joint analysis to get the same power. For obtaining power of 80%, for the genetic variants P _d = 0.10 with modest and large effect sizes at a multiplicative model, we suppose the sample size are respectively to be 2,540 and 1,227 for genetic variants with effect sizes GRR = 1.4 and 1.6, respectively.

When controlling the false discovery rate, we compare the power of linear and entropy-based joint analyses in two-stage genome-wide association studies as a function of the difference of risk allele frequencies between cases and controls (Fig. 6). The power of entropy-based joint analysis is higher than that of the linear joint analysis controlling the same false discovery rate when detecting the genetic variants with a small frequency. It makes sense if we want to attain the same power for two joint analyses, then the false positive rate of linear joint analysis will increase. For example, the false-positive rate increases from 0.05 to nearly 0.10 when the same power in two joint analyses is achieved for $\pi_{{\rm samples}} = 0.30, \pi_{{\rm markers}} = 0.01,$ GRR = 1.60, and P ^A − P ^U = 0.04.

In Table 3, we compare the power of entropy-based joint analysis with that of the linear joint analysis under four different genetic models. We find that the powers of entropy-based joint analysis are 2% higher than that of the linear joint analysis under the risk allele frequency of 0.05 by simulations. In Table 4, we evaluate the sample size needed in entropy-based joint analysis, and it shows that there are fewer samples needed in entropy-based joint analysis than that needed in linear joint analysis. In Table 5, we compare the power of linear and entropy-based joint analyses when controlling the false discovery rate for the fixed allele frequency difference between cases and controls. We can find that the false discovery rate of linear joint analysis increases from 0.05 to 0.1 when getting a power of 0.93 compared with the entropy-based joint analysis. All results show that the entropy-based analysis is more powerful and needs fewer samples for attaining the same power and achieving the same false discovery rate. These make sense, as entropy-based joint analysis uses a nonlinear function of risk allele frequencies so that it makes full use of data information from all samples.

Table 3 Power of entropy-based joint analyses for two-stage genome-wide association studies under four genetic models

Full size table

Table 4 Sample size to attain the desired significance level of two-stage genome-wide design 0.05 and power of 80% for various rare allele frequency differences and population allele frequencies

Full size table

Table 5 Power of entropy-based joint analyses for two-stage genome-wide association studies when controlling FDR

Full size table

Discussion

We have shown that the entropy-based joint analysis for two-stage genome-wide association design is a more efficient and more powerful strategy to identify genetic variants with variant effect sizes associated with a disease when testing a large number of markers using unrelated case-control samples. For achieving an overall power of 90% when detecting genetic variants both with small frequency and with small to large effects, the sample size needed in entropy-based joint analysis is about 30 fewer than that needed in linear joint analysis.

Genome-wide disease-association mapping has been herald as the study design of the next generation (Marchini et al. 2005); two-stage designs have been a promising strategy for genome-wide association studies, but the lack of analytical methods to use genotype data fully and sufficiently is a large stumbling block (Lin et al. 2004). So, we should commit ourselves to find more powerful and more efficient methods (or statistics) in the near future. The traditional test statistic in Skol et al. (2006) is a linear function of (P ^A − P ^U) in risk allele frequencies between cases and controls. Here, we introduce a nonlinear function of risk allele frequencies in cases and controls, entropy, (S ^A − S ^U) to develop novel test statistics with high power for detecting the genetic variants underlying the disease.

We investigate the distribution of a nonlinear entropy-based statistic under the null hypothesis by simulation studies. To validate the test statistic, we calculate the type I error rates of the entropy-based statistic by simulations. It shows that the type I error rates of the entropy-based statistic are close to the nominal significance levels. To evaluate the performance of the entropy-based joint analysis, we compare the power of the entropy-based statistic in single-locus association study with that of the statistic using linear function of risk allele frequencies in cases and controls by simulations. The results show that the entropy-based statistic has a higher power than the statistic using the linear function of risk allele frequencies in cases and controls. However, since the power of the statistic is a complex issue, there is not one statistic that is uniformly more powerful (Zhao et al. 2006). The entropy-based analysis is also not more powerful in all situations. When a large difference of rare risk allele frequencies between cases and controls appears, that is, |P ^A − P ^U | > 0.07, the linear joint analysis is more powerful than the entropy-based joint analysis when detecting rare genetic variants with variant genetic effects. However, these differences in rare risk allele frequencies between cases and controls are practically unrealistic in real-world studies of rare variants/common diseases.

Subsequently, we apply the entropy-based statistic to two-stage genome-wide association studies. We compare the power of entropy-based nonlinear joint analysis with that of the linear joint analysis by simulations. The results show that the power of the entropy-based joint analysis is higher than the power of the linear joint analysis in most cases when detecting rare genetic variants with variant genetic effects. However, entropy is one of the nonlinear transformations of risk allele frequencies between cases and controls. The general forms of nonlinear transformations f(P ^A, P ^U) of risk allele frequencies in cases and controls should be investigated in the future.

Here we have described entropy-based joint analysis for two-stage genome-wide association studies using independent genetic markers. But this assumption will be violated when some markers are in linkage disequilibrium. For two genetic variants each with a small effect, we should consider the interaction between loci in genome-wide association studies when they contribute modest or large effects in combination. This will be an inevitable and promising field for genome-wide association studies.

The simulations show that for a given sample size, we should genotype half of the individuals on all markers in the first stage and select the 5% of markers for follow-up genotyping in the second stage using the entropy-based statistic, which provides a practical cost-effective strategy to search for rare genetic variants in association studies. The simulations also show that for searching for rare genetic variants with moderate effects (R ₁ = 1.4, 1.6), the sample size is approximately 2,000 for the fixed rare allele frequency difference (4%, 5%) by using the entropy-based joint analysis.

In multiple tests, there is an increasing trend to use a false discovery rate as a measure of global error instead of using overall type I error rate. This article compares the power of entropy-based joint analysis with that of linear joint analysis controlling the false discovery rate when its level is set to be the same, which is usually done in the literature (Benjamini and Hochberg 1995; Zou and Zuo 2006; Zuo et al. 2006). The results also show that entropy-based joint analysis leads to higher power than linear joint analysis when controlling the same false discovery rate, which makes sense.

In conclusion, numerous genome-wide association studies for a range of diseases are being planned or are already underway. Developing new statistical methods that can deal with such large-scale studies is urgently needed to explore the etiology of complex diseases. Two-stage designs are more efficient and powerful, comparable to the one stage design. The results in this paper show that the entropy-based joint analyses are more powerful and need fewer samples for attaining the same power and achieving the same false discovery rate. Therefore, we suggest that we should use entropy-based joint analysis for two-stage genome-wide association studies.

References

Ackerman H, Usen S, Mott R, Richardson A, Sisay-Joof F, Katundu P, Tayor T, Ward R, Molyneux M, Pinder M et al (2003) Haplotypic analysis of the TNF locus by association efficiency and entropy. Genome Bio 4(4):R24.10
Article Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B 57:289–300
Google Scholar
Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York, pp 12–15
Book Google Scholar
Greiner W, Neise L, Stocker H (Translator) (1995) Thermodynamics and statistical mechanics. Springer, New York, pp 121–135
Hampe J, Schreiber S, Krawczak M (2003) Entropy-based SNP selection for genetic association studies. Hum Genet 114:36–43
Article CAS Google Scholar
Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, Cox DR (2005) Whole-genome patterns of common DNA variation in three human populations. Science 307:1072–1079
Article CAS Google Scholar
Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL,Ott J, Barnstable C, Hoh J (2005) Complement factor H polymorphism in age-related macular degeneration. Science 308:385–389
Article CAS Google Scholar
Lehmann EL (1983) Theory of point estimation. Wiley, New York, pp 343–344
Book Google Scholar
Lin DY (2006) Evaluating statistical significance in two-stage genomewide association studies. Am J Hum Genet 78:505–509
Article CAS Google Scholar
Lin S, Chakravarti A, Culter DJ (2004) Exhaustive allelic transmission disequilibrium tests as a new approach to genome-wide association studies. Nat Genet 36:1181–1188
Article CAS Google Scholar
Marchini J, Donnelly P, Cardon LR (2005) Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet 37:413–417
Article CAS Google Scholar
Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273:1516–1517
Article CAS Google Scholar
Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry S, Mullikin JC, Mortimore BJ, Willey DL, Hunt SE, Cole CG, Coggill PC, Rice CM, Ning Z, Rogers J, Bentley DR, Kwok PY, Mardis ER, Yeh RT, Schultz B, Cook L, Davenport R, Dante M, Fulton L, Hillier L, Waterston RH, McPherson JD, Gilman B, Schaffner S, Van Etten WJ, Reich D, Higgins J, Daly MJ, Blumenstiel B, Baldwin J, Stange-Thomann N, Zody MC, Linton L, Lander ES, Altshuler D; International SNP Map Working Group (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409:928–933
Article CAS Google Scholar
Satagopan JM, Elston RC (2003) Optimal two-stage genotyping in population-based association studies. Genet Epidemiol 25:149–157
Article Google Scholar
Satagopan JM, Verbel DA, Venkatraman ES, Offit KE, Begg CB (2002) Two-stage designs for gene-disease association studies. Biometrics 58:163–170
Article Google Scholar
Satagopan JM, Venkatraman ES, Begg CB (2004) Two-stage designs for gene- disease association studies with sample size contraints. Biometrics 60:589–597
Article Google Scholar
Skol AD, Scott LJ, Abecasis GR, Boehnke M (2006) Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 38:209–213
Article CAS Google Scholar
Thomas D, Xie RR, Gebregzibher M (2004) Sampling designs for gene association studies. Genet Epidemiol 27:401–414
Article Google Scholar
Thomas DC, Haile RW, Duggan D (2005) Recent developments in genomewide association scans: a workshop summary and review. Am J Hum Genet 77:337–345
Article CAS Google Scholar
Zhao JY, Boerwinkle E, Xiong MM (2005) An entropy-based statistic for genomewide association studies. Am J Hum Genet 77:27–40
Article CAS Google Scholar
Zhao JY, Jin L, Xiong MM (2006) Nonlinear tests for genome-wide association studies. Genetics 174:1529–1538
Article Google Scholar
Zou GH, Zuo YJ (2006) On the sample size requirement in genetic association tests when the proporting of false positives is controlled. Genetics 172:687–691
Article CAS Google Scholar
Zuo YJ, Zou GH, Zhao HY (2006) Two-stage designs in case-control association analysis. Genetics 173:1747–1760
Article CAS Google Scholar

Download references

Acknowledgments

We would like to thank two referees for very helpful comments on an earlier draft. This work was supported by grant DMS 0234078 from the National Science Foundation to Y. Zuo.

Author information

Authors and Affiliations

Department of Statistics and Probability, Michigan State University, East Lansing, MI, 48824, USA
Guolian Kang & Yijun Zuo

Authors

Guolian Kang
View author publications
You can also search for this author in PubMed Google Scholar
Yijun Zuo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guolian Kang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kang, G., Zuo, Y. Entropy-based joint analysis for two-stage genome-wide association studies. J Hum Genet 52, 747–756 (2007). https://doi.org/10.1007/s10038-007-0177-7

Download citation

Received: 26 February 2007
Accepted: 02 July 2007
Published: 01 September 2007
Issue Date: September 2007
DOI: https://doi.org/10.1007/s10038-007-0177-7

Keywords

This article is cited by

Multi-strategy genome-wide association studies identify the DCAF16-NCAPG region as a susceptibility locus for average daily gain in cattle
- Wengang Zhang
- Junya Li
- Yan Chen
Scientific Reports (2016)
Two-stage designs to identify the effects of SNP combinations on complex diseases
- Guolian Kang
- Weihua Yue
- Dai Zhang
Journal of Human Genetics (2008)

Entropy-based joint analysis for two-stage genome-wide association studies

Abstract

Similar content being viewed by others