Introduction

Genetic association studies rely by design on the presence of linkage disequilibrium (LD) between the yet unknown susceptibility locus and the neutral markers that have been genotyped. LD describes the nonindependence of alleles segregating at two or more loci. The sample size necessary to detect the susceptibility locus by studying a nearby neutral marker is supposed to be inflated by the inverse 1/r2 of the LD r2 between them. For instance, if r2 between marker and susceptibility locus is 0.5, then the sample size needed to achieve the same power would be twice as large.1

This concept of genetic association has recently been named ‘the Fundamental Theorem of the HapMap’ and received criticism based on theoretical grounds.2 The main issue raised is that r2 is not necessarily multiplicative across multiple loci. It is commonly assumed that given three subsequent loci A, B and C, LD between A and C is the product of LD between AB and BC: rAC=rAB × rBC. Terwilliger and Hiekkalinna (T&H) correctly pointed out that this relationship does not need to hold.

As has been suggested before, genetic heterogeneity is a plausible biological mechanism for the lack of multiplicity.3 If the genotyped marker A is in LD with two different susceptibility loci (B and C), which cause disease D, then the multiplicativity of r2 will not hold (rAD=rAB × rBD and rAD=rAC × rCD). Hence, an important issue with respect to the feasibility of LD-based gene mapping in the light of the critique by T&H is to determine how often we might expect that any given neutral marker that truly tags a susceptibility locus also tags another. T&H show that underestimation of the necessary sample size for an association study as a result of deviation from the multiplicativity of r2 is possible. Here, we aim to assess from the HapMap data itself how probable this deviation is, assuming that ‘cryptic’ tagging of multiple susceptibility loci is the predominant biological cause of this effect and an additive model for penetrance.

Theory

Testing for association is equivalent to testing if the correlation between the tagging SNP (T) and disease status (Ca) is equal to 0. Under the standard assumption that T is independent of Ca conditional on a disease locus D with one ‘healthy’ and susceptibility allele, the correlation between T and Ca is given by the familiar expression

Here, we explore a biologically plausible scenario that could cause a deviation from this ‘multiplicativity of the correlation’: the situation where multiple correlated disease loci affect the phenotypic outcome.

Assume that there are k≥1 disease loci, and let D denote a haplotype at these loci. Assume that all disease loci are biallelic, with one variant a disease allele and the other a healthy allele, and let #D denote the number of loci among the k loci in D that carry the disease variant. Assume that, for some numbers α and β,

This corresponds to an ‘additive model’ in which every disease allele adds an amount β to the penetrance. According to this formula, the prevalence is

for E(#D) the average number of disease alleles of an individual in the population. Provided the second term βE(#D) is small, the parameter α can be approximately interpreted as the prevalence.

Let Di be the event that an individual has the disease allele at the disease locus i (with the other loci unspecified, so that Di is a union of certain haplotypes D), and correspondingly let be the marginal frequency of the disease allele at locus i. We prove in the appendix that, under the additive model (2)

This formula is exact if the additive model (2) is exact. If the allele frequencies are equal, then the square root in the formula is 1 and disappears, and, moreover, (under (2)) the correlations are equal. The formula then simplifies to

The correlations here refer to the association of the disease loci in the population (and not linkage). For two disease loci (k=2), the formula becomes

In comparison with formula (1) for the one-locus model, this exhibits the additional multiplication factor

This factor quantifies the bias introduced in the correlation between T and Ca, hence in the necessary sample size to detect association, due to cryptic tagging. When the two disease loci are not correlated and T does not tag D2, this factor reduces to 1. However, when the two disease loci are indeed correlated and both are tagged by T, Δ can still reduce to 1 if the multiplicativity of the correlation coefficients, which is not assumed, does hold: if , then Δ=1. In all other cases, cryptic tagging will introduce a bias in the necessary sample size to detect association equal to the inverse of Δ2.

HapMap data

Δ is a function of the three pairwise correlation coefficients between tagging SNP and both disease loci. The frequency distribution of these correlation coefficients can be estimated from the phased genotype data available from the HapMap project (http://www.hapmap.org/downloads/index.html.en). For this analysis, we used data from the CEU population.

As the distribution of LD may differ between chromosomes and between SNPs of different minor allele frequency (MAF), we considered each chromosome individually and used five different bins of MAF in the analysis. SNPs with MAF of 5–10, 10–20, 20–30, 30–40 and 40–50% were considered separately. All SNPs on a chromosome that fall in a given MAF bin were ascertained and two were chosen randomly. These represent the susceptibility loci D1 and D2. Next, the extent of LD was determined between the D1 locus and all other SNPs on the chromosome. One SNP was randomly picked conditional on it being in LD with the D1 locus with an r2 larger than 0.8. This is the tagging SNP T, which tags D1 and may or may not, depending on the haplotype structure of the genome, tag D2. With the three SNPs D1, D2 and T chosen in this manner, the three pairwise correlation coefficients were calculated and Δ determined. This procedure was repeated 1000 times for each chromosome and MAF–bin combination to generate a genome-wide distribution of Δ.

We did not find any systematic differences in the locations of the distribution of Δ2 between chromosomes (Kruskal–Wallis test), but the more rare SNPs showed a visible concentration around 1 (Figure 1) (shape difference confirmed by Kolmogorov–Smirnov test, P<0.001). The observed values of Δ2 ranged between 0.00018 and 1.63. This large range is in line with the theoretical prediction of T&H that the upper limit for the necessary sample size to detect association when multiplicativity of r2 is not assumed includes infinity. However, the percentiles of the distribution of Δ2 show that the extreme values are rare. In 95% of our data, Δ2 lies between 0.92 and 1.09 and except for one case Δ2 did not reach below ≈0.7. (The 0.1, 0.5, 1, 2, 2.5, 3, 4, 5, and 97.5 percentiles were equal to 0.79, 0.86, 0.88, 0.91, 0.92, 0.93, 0.94, 0.95, and 1.09, respectively.) We also explored other scenarios (with the two disease loci chosen either randomly with respect to minor allele frequency or ascertained such that one would have MAF≈0.1 and the other MAF≈0.4), but the distribution of Δ2 was similar to the one described above (data not shown). In conclusion, more than 5% of all association studies would need 5–30% larger sample sizes to achieve the same power.

Figure 1
figure 1

Distribution of the bias parameter Δ2. The value of Δ2 is given on the vertical scale. Shown are scatterplots for the five MAF classes described in the text for each of the 22 autosomes.

Incidentally, the lowest value of Δ2 involved a pair of common alleles on chromosome 17. The correlation coefficients were , and , representing a situation where the susceptibility allele at one locus is in fairly strong LD with the healthy allele at the other locus while the tagging SNP tags both loci equally. In such a rare, worst case scenario, detecting association through tagging is virtually impossible, even under our additive model (2).

Discussion

In this paper, we empirically assess the deviation from the HapMap theorem induced by cryptic tagging of multiple susceptibility loci by a neutral SNP. This scenario seems the most likely biological mechanism that might result in the nonmultiplicativity of r2. Conditional on the haplotype structure of the genome, a tagging SNP in LD with one susceptibility locus might also exhibit high levels of LD with another susceptibility locus. In fact, this paper identifies this distribution of r2 values between a tagging SNP and a randomly placed second ‘susceptibility’ locus on the basis of the CEU HapMap data, under an additive model for penetrance. In agreement with T&H, we find that nonmultiplicativity of r2 can indeed decrease the power of an association study, but show that this bias introduced by cryptic tagging is relatively modest. We did observe one instance where cryptic tagging would have completely abolished the power of an association test through a tagging SNP, as predicted by T&H. However, this scenario is extremely rare as long as susceptibility loci do not tend to be colocalized in the genome. If that were the case, underestimation of the necessary sample size owing to cryptic tagging would be much more severe. As also other mechanisms than the one analyzed here might account for the nonmultiplicativity of r2, some caution in designing genome-wide association studies seems to be in place. On the basis of these results, a safe bet is to use a sample size that is 10% larger than otherwise deemed necessary.