Introduction

Large-scale genetic association studies generally depend at least in part on the existence of linkage disequilibrium (LD) between genetic markers and a disease locus (for a review see Clark1). LD is a concept of statistical correlation between alleles segregating at two or more loci.2 Conversely, linkage equilibrium refers to the state where the alleles at a particular locus are independently distributed with respect to the alleles at an alternative locus. There are several ways in which LD can be generated in a sample of chromosomes. Population genetics factors can produce LD through a variety of processes such as natural selection, strong genetic drift, admixture and new mutations.3, 4, 5 Similarly, skewed sampling of chromosomes from a population, for example, the selection of disease-affected individuals, can also give rise to LD levels higher or lower than expected. Importantly, this type of skewed sampling can also produce departures from what is termed the ‘fundamental theorem of the HapMap’ (terminology from Terwilliger and Hiekkalinna, 2006).6 The fundamental theorem states that the statistical power to detect disease association indirectly at a marker locus in LD with a disease-susceptibility locus is approximately the same as the power to detect disease-association directly at the susceptibility locus, if the sample size is increased by a factor of 1/r2, where r2 is the commonly used measure of pairwise LD. Put concretely, imagine two loci, one directly involved in disease and the other in LD with the first, causative locus with r2= 1 3 between them. If 500 cases and 500 controls are required to obtain 80% power to detect disease association at the causative locus, then the fundamental theorem states that approximately 1500 cases and 1500 controls are necessary to reach the same 80% power level at the marker locus in LD with the causative locus. As pairwise LD is correlated with disease status, some level of departure from the fundamental theorem is to be expected. This point was made explicitly by Terwilliger and Hiekkalinna6 where they argued that the fundamental theorem only applies, among other conditions, if the ‘LD between loci and the etiological effect of the functional variant are independent of each other.’

Recently, there has been considerable interest in creating and utilizing whole-genome haplotype maps for the purposes of disease-susceptibility mapping in humans7, 8 and investigation of genetic structure of populations. These maps allow one to quantify the strength of LD across the entire genome. By choosing representative markers from sets of markers that are in high LD with each other, investigators aim to reduce drastically the number of markers necessary to interrogate adequately the genome.9, 10 Additionally, the pattern of disease association decay with decreasing pairwise LD can be used to identify regions that are more likely to carry predisposing chromosomal segments. As LD maps are typically constructed from randomly sampled individuals, understanding the effect that different disease models have on modifying the level of LD in patients is important: such information can be used to (i) better select tagging markers for large-scale studies and (ii) construct statistical tests to better understand if specific regions are disease-predisposing. To these ends, we wanted to investigate the impact that different disease models have on traditional measures of LD.

In this paper, we derive simple analytic results for commonly used measures of LD under general single gene disease models, defining regions of the parameter space that give rise to LD levels in disease-affected individuals either above or below the general population LD level. We then investigate LD sampling properties given general population haplotype frequencies under a neutral coalescent. The results characterize the effect of disease models on LD patterns. This work may change how current HapMap data are used to select tagging single-nucleotide polymorphisms (SNPs). For example, in some instances, it may be desirable to genotype densely a small set of affected individuals alongside a small set of control individuals, just as the HapMap project densely genotyped randomly selected individuals. Additionally, selection of SNPs to perform fine-scale mapping, once associated markers are identified, may be informed by explicitly modeling LD patterns differently between cases and controls. Lastly, this type of information may also enable improved statistical tests for identifying regions with disequilibria patterns that correspond to those expected under certain disease models.

Important developments in this area can be found in a study by Nielsen et al,11 where the authors construct a statistical test using LD differences between cases and controls, thereby providing researchers an additional method for testing for association, aside from the more traditional haplotype-based contingency table tests of homogeneity. More recently, an extension of this work was published showing analytic and graphical methods for this LD-contrast-type test.12

Theory

To characterize pairwise LD in preferentially selected groups of individuals, we will define a simple single gene disease model and explore two commonly used measures, D and r2, as a function of penetrance parameters and allele frequencies. Both asymptotic and sampling results are presented. For a two-locus model, say loci A and B, in which two alleles segregate at each locus, D is defined as p11p22p12p21, where pij is the frequency of the AiBj haplotype. Denote A1B1 and A2B2 as parental haplotypes, and the remaining two as recombinant haplotypes. See Devlin and Risch,13 for a review of these and other measures of LD. r2, the squared correlation coefficient between alleles at the two loci,14 is a normalized version of D and is defined as

denoting the margins (single-locus allele frequencies) p11+p12 and p11+p21 by p1• and p•1, respectively. We will treat both LD measures as being calculated in two distinct ways: as population parameters and as sampling statistics from a small set of chromosomes from affected individuals, the former of which will be called ‘asymptotic results’ and the latter ‘sampling results’.

Suppose now that the A locus postulated above has a variant that predisposes carriers to a disease phenotype. Following this characterization, designate the B locus as the marker locus with no causal relationship to the disease phenotype. Denote the two alleles segregating at the A locus by A1 and A2. Further define three genotypic penetrances to specify a single gene disease model, f11=P[DzA1A1], f12=P[DzA1A2] and f22=P[DzA2A2]. Let us define the frequencies of the haplotypes in affected individuals as , using analogous definitions as in the general population. Assuming Hardy–Weinberg equilibrium (HWE), we arrive at the set of affected haplotype frequencies by applying Bayes’ rule:

where the prevalence of disease is K=P[Dz], and making use of the Hardy–Weinberg assumption, K=f11p1•2+2f12p1•(1–p1•)+f22(1–p1•)2. Similar equations, using different notation, can be found in Nielson et al.11 The single-locus allele frequencies in affected individuals are simply

It is now a matter of simple algebra to calculate LD measures in patients under the general single gene model. Combining the above results allows , the LD in all affected individuals, to be expressed in terms of the general population D and multiplicative factor:

where the infinity subscript is shown to indicate that this is an asymptotic result applying to the population as a whole. If we set each of the penetrances equal to the same constant, it can be easily verified that as expected. Similarly, an expression for can be obtained

For the sake of brevity, we write in terms of the allele frequencies in the affected individuals in the denominator:

The ratio of to is therefore

where

These asymptotic LD measures can be examined under specific disease models by positing relationships between the three penetrances. Evaluation of four classic models, dominant, recessive, additive and multiplicative, will shed some light on how phenotype-based sampling modifies levels of LD. First, consider the pure dominant model where f11=f12 and f22=0 (this and other models are considered ‘pure’ models when one or more of the penetrances of genotypes not carrying a predisposing allele is 0 – that is, the prevalence is zero in the absence of the predisposing allele). Under this dominant model,

Hence, when the probability of the predisposing allele, p1•, is less than (approximately 0.381966), ; otherwise, the LD in patients is less than the general population value, ignoring trivial solutions. Generalizing this dominant model by considering f22≥0,

For this and subsequent results, the notation is changed to genotype relative risk (represented as γ), such that (f11/f12)=(f12/f22)=γ for the above model. in patients under the pure recessive mode of inheritance, where f12=f22=0, is zero regardless of allele frequency. The reason for this is that all patients must have the A1A1 genotype and therefore the only two possible haplotypes, A1B1 and A1B2, necessarily yielding . The general recessive model, f22=f12, (f11/f22)=γ, has richer dynamics:

Analysis of equation (13) shows that for high-frequency predisposing alleles, ; otherwise low-frequency predisposing alleles with a recessive mode of inheritance produce higher LD levels in patients (see Table 1). Considering the two intermediate models, general additive (f11/f22)=2γ–1,(f12/f22)=γ and multiplicative (f11/f22)=γ2,(f12/f22)=γ models, give

and

respectively. Under the additive model, for allele frequencies, p1•, above . Similarly, for the multiplicative model, when . Additionally, modes of inheritance where the penetrance of the heterozygote is smaller than either of the homozygotes, or an ‘underdominant’ model, the inequality always holds. Table 1 shows a summary of results under various inheritance models with analytic results of LD isoclines.

Table 1 LD relationships under various disease models

As the allele frequencies pi and pj are invariant to the effects of recombination, the ratio of to D does not vary with recombination fraction. However, this is not the case with r2. The ratio does change, often dramatically, with increasing recombination. This is due to the inability of the allele frequency at the marker locus within affecteds to be expressed solely in terms of penetrances and population allele frequencies in lieu of haplotype frequencies. Essentially, our immediate goal here is to evaluate the ratio for infinite populations following different numbers of generations. Subscripts are shown to indicate that the quantities are now a function of accumulated recombination following the passage of t generations.

where the factor b is not a function of recombination rate, and can be shown to be

To characterize the effect that recombination has on the marker locus, we first use the standard recursion-based derivation for the haplotype frequencies following t generations,

Hence, substitution of the right-hand side of equation (18) into equation (7) yields

which, in turn, is used to complete the derivation of Rt in terms of recombination fractions, generations, and the initial state of the system.

Figures 1a and b show the decay in LD between a causative locus A and marker locus B as recombination increases between the two loci. LD levels within affected individuals and within the general population were calculated under three different disease models. In all models, initially only the parental haplotypes are present. These initial parental haplotype frequencies are 50 and 50% (p11=0.50; p22=0.50) across all models. The number of recombination events increases linearly in time. The first model investigated is of a disease with a dominant mode of inheritance against a background of sporadic disease. The results under this model are presented in Figure 1a. Figure 1a shows the common situation where Rt>1. Rt is above unity regardless of the number of recombinant haplotypes, and the inceptive decay of LD is more rapid in the general population. Figure 1b shows the decay of LD situation under a recessive model. LD decay patterns under an underdominant model were studied next. The departure in LD in the affecteds from the general population under the underdominant model is more extreme than a recessive model, with certain recombination rates yielding Rt values below ½ (results not shown). In both models, LD levels in the disease population are lower than that in the general population across recombination fractions. It should be pointed out that empirically the patterns of LD across organisms are ubiquitously complex, and not fully determined by the effects of recombination. Although the results presented in this paper are approximate (we only model the effect of recombination on LD), we believe that it might be of interest.

Figure 1
figure 1

r2 values for the general population are shown in a solid line, whereas r2 values for affected individuals are shown in the dashed line. (a) Decay of LD from cumulative recombination events with and without a disease model. Under this dominant model, the LD among affected individuals is always equal to or higher than the LD level in the general population. Switching to a recessive model (b) displays the opposite pattern with Rt<1 for any nontrivial level of recombination. The relative risk of predisposing to nonpredisposing genotypes under both models is 10.

Potential impact on power

To illustrate how differences in LD patterns between cases and controls can impact disease gene mapping, consider the following example. Suppose the general population haplotype frequencies are p11=0.10, p12=0.01, p21=0.01 and p22=0.88. The minor allele frequency at either site is 11%, and the r2 statistic in the general population is 0.8061 – above many commonly used thresholds employed by many procedures to identify SNP pairs in high LD. Further consider a recessive disease model with penetrances f22=0.05, f11=f12=0.0001. Disease prevalence, assuming HWE, is 4% under these conditions. The haplotype frequencies in affected individuals are expected to be , and , yielding dramatically lower LD in the affected population: . Trouble can arise in this situation if an investigator assumes the general population LD level before a case/control experiment. Given the high LD in the general population, one may assume that second site (locus B in the terminology used above) could be used as a tagging SNP for the first locus (the disease-predisposing locus). Assuming 250/250 case and control chromosomes used in a genome-wide association scan, power to detect disease association at the B locus is approximately 35% (taking a Bonferroni-corrected significance level of 1 × 10−7). However, if the investigator had restrained from assuming that LD patterns across cases and controls were similar, and perhaps went further to genotype densely in a small set of affected individuals, then noting that may have persuaded this judicious researcher to genotype both the disease and marker loci (both the A and B SNPs). Had both loci been genotyped in the case/control experiment, power would more than double to 73% for a two-locus haplotype test (using the same sample size and significance level). In both of the above power calculations, a Monte Carlo simulation running 100 000 replicates was used. Although this is an extreme example, it nonetheless demonstrates the possibility that ignoring the impact of disease models on LD can hinder mapping efforts.

One can frame this power-based argument in terms of the ‘fundamental theorem’ describing the relationship between power to detect association indirectly at a marker locus and the r2 value between the marker and a disease-susceptibility locus. More precisely, it states that if a certain sample size is required at a disease locus to detect disease association at a given level of power, the sample size must be increased by a factor of 1/r2 to obtain the same power indirectly at a marker locus. This simple relationship is described in Lai et al15 and Pritchard and Przeworski.16 This relationship is a good rule of thumb, but in a case–control setting, deviations caused by disease models can be substantial as pointed out recently by Terwilliger and Hiekkalinna.6 In most realistic instances, the underlying disease model modifies the ‘fundamental theorem’ from −10 to +10%. That is, the sample size estimated to be required to detect association at a particular power level is over- or underestimated by approximately 10%. Hence, assuming that of the several assumptions of the fundamental theorem mentioned by Terwilliger and Hiekkalinna6 that the only one violated is the independence of etiology and LD patterns, our conclusions concerning the inaccuracies of the fundamental theorem are less extreme than those put forth by Terwilliger and Hiekkalinna. See Figure 2 for an evaluation of a number of these likely more realistic models. In situations where , the fundamental theorem underestimates the sample size needed to detect disease association at a marker in LD with a disease-susceptibility site. Conversely, the fundamental theorem overestimates the sample size when .

Figure 2
figure 2

Error rate in the number of samples required to obtain a given power threshold to detect disease association as estimated by the fundamental theorem is explored for various disease models. LD is measured in terms of the r2 measure. In each case, the percentage error increases with decreasing pairwise LD. Typical error rates are in the range of ±10% across much of the parameter space. Minor allele frequencies are around 10% in the general population across all models at both loci. Diploid cases/control (500/500) and 0.05 sig level assumed across all plots. (a) An underdominant model with penetrances of f11=0.08, f12=0.04, f22=0.08. (b) A general dominant model with penetrances of f11=0.08, f12=0.08, f22=0.04. (c) Results under a general recessive model with penetrances of f11=0.20, f12=0.02, f22=0.02. Lastly, (d) shows an additive model with penetrances of f11=0.05, f12=0.10, f22=0.15. Power to detect disease association at the marker locus is displayed with the error percentage.

Sampling properties given general population haplotype frequencies

To this point, our analysis has been restricted to properties in an infinite population. However, haplotype sampling properties are of particular interest as data invariably are in the form of a sample of chromosomes from the general population. Define xij as the number of copies of the AiBj haplotype in a sample of n chromosomes. Sampling of haplotypes will be based on the modified haplotype probabilities derived above equations (2)–(5). Assuming a very large general population, the joint probability of the number of copies of each of the four haplotypes is the multinomial density

As for any finite n, P[p1•=0]>0 and P[p•1=0]>0, we can redefine the r2 statistic in either of these cases as the limit of r2 as an allele goes to fixation in the sample. It is readily shown that and . Hence, we set r2=0 in situations where there are no copies of one of the alleles (at either locus) in the sample.

Following equation (20), multinomial-distributed haplotype counts were generated via computer simulations. The aim of these simulations was to better understand sampling properties of the r2 statistic under disease models and general population. The results of these simulations are presented in Figures 3a and b, showing the 2.5th and 97.5th quantiles as well as the mean value of r2. A recessive model and an additive model were explored in Figures 3a and b, respectively. Not surprisingly, in both models the sample variance of is smaller than r2. This is due to the selected sampling of chromosomes for affected chromosomes, which, under many models, are more genetically homogeneous than a random sample from the general population. Hence, these sampling effects lead to reduced sampling variance.

Figure 3
figure 3

(a) Produced from 10 000 replicates per data point; no disease model has equal penetrances; recessive disease model parameters are f11=0.10, f12=f22=0.001; the general population haplotype frequencies are p11=0.10, p12=0.05, p21=0.05, p22=0.80. (b) Produced from 10 000 replicates per data point; no disease model has equal penetrances; additive disease model parameters are f11=0.09, f12=0.05, f22=0.01; the general population haplotype frequencies are p11=0.10, p12=0.05, p21=0.05, p22=0.80.

There are many uses of the variance of different LD measures. For example, the LD contrast method of Nielsen et al11 uses Var[D] in the test statistic. Weir17 solves for this quantity,

Substitution of allele frequency and D values from disease-affected samples enables the calculation to be made for . Figure 4 shows results comparing the variance in D between general population samples and samples selected on the basis of disease phenotype. Hundred diploid individuals were used in the calculations. Three different disease models were evaluated: recessive, dominant and additive, and results are presented as a function of relative risk. These results show only a mild to moderate departure from the variance in the general population samples, with the variance in disease samples being approximately 10–15% lower at the more extreme relative risks.

Figure 4
figure 4

A comparison between the sample variance in D under three modes of inheritance: recessive, dominant and additive, and the sample variance for general population samples. Hundred diploid individuals sampled. r2=0.27 in the general population. Minor allele frequency of 30% at both sites was modeled.

Sampling properties under the neutral coalescent

In the previous section, we explored the sampling properties of haplotypes preferentially or nonpreferentially sampled from the general population in accord with a disease model. In those simulations, the unselected haplotype frequencies were given. One may also be interested in the situation where those general population haplotype frequencies are randomized. A simple and flexible method to do so is to generate the general population haplotype frequencies under a Wright–Fisher model using a large-sample neutral coalescent with recombination.18 Although analytic approximations for population-based two-locus models exist, extensions to more complicated demographic models are much more straightforward under a coalescent simulation. The large-scale coalescent-generated haplotypes constitute the general population from which disease haplotypes are sampled according to penetrances. In these simulations, 5000 two-locus chromosomes were generated from which 100 chromosomes were sampled using probabilities proportional to the disease haplotype probabilities. Four different disease models were explored: dominant, recessive, underdominant and additive modes of inheritance. r2 and were calculated for the general population and disease population samples. The mean and 0.025 and 0.975 quantiles for both correlation statistics are reported in Table 2 below for both the general population and the 100 disease haplotypes. Under most replicates, the A1B1 haplotype is the most frequent. Table 2 summarizes this simulation study. When compared to the analytic results, these neutral coalescent results appear to corroborate the general patterns of LD with the distribution of being shifted from r2, with the largest reduction departures being found in recessive and underdominant models and the largest inflation departures for dominant and additive modes of inheritance. Over the models examined, the 97.5th quantile varies roughly by a factor of 4. The 95% confidence interval under the high frequency recessive model is approximately half the value in the general population (which is close to the sampling distribution value averaged across models) and the dominant model exhibits slightly greater than a twofold increase in the 95% confidence interval over the general population.

Table 2 Disease sampling from neutral coalescent-generated haplotypes

Discussion

In this paper, we have explored the effect of disease models on pairwise measures of LD. Analytic work was able to delineate regions of the parameter space where Often, the disease population exhibits higher or similar levels of LD when compared to the general population. As the affected population is selected based on the presence of an ancestral segment of DNA harboring the predisposing variant, this is the most intuitive scenario. However, all underdominant, some dominant and recessive, and protective models are capable of generating LD values substantially below those observed in the general population. This has important ramifications for disease mapping using LD-based methods. For example, a common methodology is to select a set of markers based on the observed LD in a region in the general population. The statistical power for a given sample size can then be estimated. However, if the LD in the affected population is lower than that in the general population, statistical power may be greatly overestimated. This results from the correlation between the associated allele at the marker and the predisposing allele being lower than expected based on the observations in the general population. Because of this, the allele frequency in the affected population will be less strongly influenced by the proximity of the predisposing locus and hence the difference in allele frequency between cases and controls may be less than expected under the assumption of equal LD in both populations. In this way, regions of the genome may be poorly interrogated. A possible way to alleviate these false negatives is to select a set of markers that potentially provides high levels of LD under a variety of plausible models.

Sets of tagging SNPs for mapping studies are often selected to minimize the number of SNPs assayed. That is, a single SNP may act for a number of other markers for which LD is high. As most tagging SNP programs are designed for and applied to randomly selected chromosomes from presumably unaffected individuals for use in disease association studies, incorporation of these effects may increase the efficacy of such tagging SNP procedures. This is particularly true in instances where the level of LD differs substantially between affected and unaffected individuals or where LD levels are markedly reduced in affected individuals when compared to chromosomes randomly drawn from the general population. Sets of markers based on these assumptions may be inadequate for situations where LD is greater in the affected population. For example, as we have shown, ignoring LD differences between cases and controls can produce nonoptimal power calculations.

Hopefully, this work will motivate further exploration of the impact of different LD patterns on association studies. For example, it may be possible to develop new statistical tests based both on the patterns of LD and haplotype frequency differences between case and control samples that provides high power to detect disease-predisposing regions. Any scenario where there is strong selection bias, be it positive or negative as in the case of a disease, LD may differ. Detecting such signatures of selection bias will undoubtedly add to our understanding of human disease etiology.