Introduction

Genome-wide association studies have been very successful at identifying loci involved in rare Mendelian traits in population isolates, and as such have been suggested in recent years to be potentially useful for dissection of the etiology of complex traits as well. With this aim in mind, the International HapMap project has been developing a dense genome-wide map of single-nucleotide polymorphisms (SNPs) and characterizing the linkage disequilibrium (LD) among them. The goal is ultimately to allow scientists to select subsets of the SNPs, which are in strong enough LD with the untyped SNPs to allow them to serve as useful surrogates (ie to reduce dimensionality of a genome scan by selecting a maximal set of markers showing a minimal amount of LD among themselves). This is the essence of linkage analysis as well – in which a relatively sparse marker map is used to infer the inheritance vectors in families at every genomic position with reasonable and predictable certainty. There are, however, many ways in which application of the correlations owing to LD differ quantitatively and qualitatively from those due to linkage. LD and linkage are different ways of assessing essentially the same phenomenon, as LD exists only when copies of a given molecular variant shared by two individuals are clonal copies of the same ancestral alleles (identical-by-descent (IBD) in the population), and markers nearby are shared as well because of a paucity of recombination between the loci historically – that is linkage (see Terwillger1, 2 for a review of this relationship).

In genome-wide mapping studies, one does not presume to know what the etiological architecture of the trait under study is in truth. However, study design and analysis methods are predicated on sets of assumptions, because power under different approaches can only be compared under some mathematically tractable models of ‘truth’. Association studies are often argued to be more powerful than linkage studies for various reasons. It is rather obvious that if you measure a functional polymorphism directly, it will never be less correlated to the trait than a marker that is both linked to and in LD with the functional site. Furthermore, it is similarly obvious that such a marker can never be less correlated to the trait than a marker that is linked to, but not in LD with the functional site. But this does not mean that linkage analysis is less powerful than association analysis, and says nothing about whether one would have more power studying families or unrelated individuals, although it is clear that having access to a well-characterized dense map of markers across the genome and understanding their LD relationships could be potentially useful. But the question of how valuable such an approach will be is a function of many parameters describing the assumed etiological models, not to mention the study design employed. In order for association studies to work, one needs the phenotype being studied to predict the genotypes of the locus to be identified to a reasonable degree – that is, to have high ‘detectance’.3, 4 Furthermore, it is necessary for there to be LD between the functional variant(s) and at least one allele of one of the markers being studied. And finally, one often further assumes that the marker genotypes are statistically independent of the trait, conditional on the genotypes of the functional site in question.

The substantial debate in the literature about the prognosis of genome-wide association studies for mapping genes involved in multifactorial traits to date has focused on the first of those issues – whether or not significantly high detectance is expected for such traits, and whether or not the subset of SNPs selected for analysis will be in sufficient LD with the functional genotypes predicted by the phenotype of interest. These arguments focus on issues related to the common-variant/common-disease (CVCD) hypothesis,1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 the quality and quantity of LD2, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40 and the effects of different population and study design/ascertainment options.3, 8, 19, 38, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 However, the ramifications of assumptions about the independence of marker genotypes and trait phenotypes conditional on genotypes of the functional variant of primary interest have not been described in much detail, despite the centrality of this assumption. In this paper, we address the critical importance of this issue, and demonstrate numerous reasons why such conditional independence of exposures rarely exists in practice, using a combination of theoretical and empirical arguments. We thus present a further criticism of the rationale for genome-wide association studies based on structural theoretical arguments, independent of the widely known counterarguments outlined above. In fact, for purposes of this manuscript, we assume that there are common risk alleles that are detectable if they were themselves genotyped, and show that markers in high LD with the functional variants may never show evidence of association in infinite sample sizes.

Measures of LD between two SNPs

The coefficient of LD between alleles of two SNPs with alleles (A/a) and (B/b), respectively, is defined as δ=DAB=pABpApB,51, 52 where pAB denotes the frequency of the haplotype bearing alleles A and B at the two loci, and pA and pB denote the allele frequencies of alleles A and B. Since DAB varies enormously as a function of the allele frequencies, it is common to quantify LD rather in terms of measures which attempt to normalize for the effects of allele frequencies. One such measure is defined as

This measure has been generally preferred by population geneticists, as it has predictable behavior as a function of the recombination fraction between the SNPs, the demographic history of the two polymorphisms in the population. An alternative standardized measure, the correlation coefficient, is often used by statisticians, because it has predictable behavior concerning not the evolutionary relationships among markers, but rather concerning the power to detect the correlation in a sample. This measure is defined as

The sample estimate of ρAB is conventionally denoted as rAB. It can be shown that the test statistic for the conventional χ2 test of independence on a 2 × 2 table of haplotype frequencies of the alleles of these two loci is numerically equivalent to X2=NrAB2, where N denotes the sample size. Leaving aside the philosophical and scientific rationales for preferring one metric over the other, in this paper we focus on the properties of the ρ2 estimates as predictors of the power of an association study, owing to the following factual relationship: If SNP B has a functional relationship to phenotype C, then under simple random sampling, the χ2 statistic relating B and C would be X2=NBCrBC2, and in order to obtain the same numerical value for a χ2 statistic relating marker A to phenotype C instead of typing marker B would require sample size

Underestimation of sample size requirements due to upward bias in LD estimates from small samples

The so-called haplotype map (or HapMap) of SNPs spanning the genome was designed to facilitate genome-wide association analysis, based on extensions of the relationship described above relating test statistics and correlation coefficients. In a recent paper, Gabriel et al53 claimed that ‘… the average maximal r2 value between each additional SNP and the haplotype framework was high, ranging from 0.67 to 0.87 in the four population samples. That is, for the average untested marker, only a small increase in sample size (15–50%) would be needed for the use of a haplotype-based (as compared to direct) association study.’53

It is well known that these measures of LD are strongly biased in an upward direction in small samples (with the D′ metric being more strongly biased than r2), because the measures are both defined to be nonnegative.37 While the bias in the measure r2 is often not large in magnitude, the effects of the bias on estimates of 1/r2, which is claimed to be linearly related to the required sample size and thus to the power of association studies can be enormous even with samples much larger than those being used in the International HapMap project to characterize LD across the genome.53 Figure 1 shows graphically the magnitude of the bias for a variety of sample sizes, based on simulation of 1 000 000 data sets, in which the allele frequency for the rare allele at each locus was 0.1. In this figure, the x-axis shows the true value of 1/ρ2 with the y-axis showing the expected value of 1/r2, which is theorized by Gabriel et al53 and others to be a measure of the increased sample size needed when replacing a functional variant in an analysis by a tag SNP in LD with it.54 In each graph, if the estimates were unbiased, Figure 1 would contain only the straight line x=y. It is apparent that 1/r2 is estimated to be much lower than 1/ρ2, and thus provides a gross underestimate of the needed sample size. But this is just the tip of the iceberg when it comes to problems in the theory, which we now examine in detail.

Figure 1
figure 1

A total of 1 000 000 replicates were simulated of data sets of varying size, from 10 to 100 samples to demonstrate the small sample bias in estimates of the squared correlation coefficient, and its reciprocal. This figure shows the bias in 1/r2 as an estimator of 1/ρ2 for sample sizes 10, 20, 50, and 100 in order from the lowermost to uppermost curves in the figure. The straight line x=y would represent an absence of bias in the estimators. Note that 1/r2 is used as an estimator of the multiplier for the sample size needed for equivalent power using marker A instead of functional variant B in an association study according to the ‘Fundamental Theorem of the HapMap’. In this graph, the line x=y would represent the theoretical predictions, and the other curves show the effects of underestimation of this term owing to small sample bias. For purposes of figure, both loci have minor allele frequency of 0.1, the best-case scenario expected for LD mapping with SNPs.

Sample size for genome-wide linkage analysis

The reason why genome-wide linkage analysis has been successful in reducing the dimensionality of a genome scan is that the correlations in inheritance due to linkage are strictly a function of meiotic recombination frequencies, which have highly regular behavior, as reflected by the existence of mapping functions. That is to say that in a linkage analysis, one measures cosegregation of loci in families, which is solely a function of the genetic distance between the loci. If there are three loci, X, Y, and Z, and a recombination event occurs between X and Y, and no recombination occurs between Y and Z, then the alleles of loci X and Z must be likewise recombinant in that meiosis, whether or not they are syntenic. Such simple deterministic relationships make linkage mapping mathematically tractable. It can be shown that, in general, for an ordered set of marker loci, XYZ, θXZ=θXY+θYZ−2XYθYZ, where θXY is the probability of recombination between loci X and Y, and where c is the ‘coefficient of coincidence’, a measure of the strength of crossover ‘interference’, or nonindependence of recombination in adjacent intervals. For the most part, while there is evidence of weak interference in real data, most statistical analyses assume that c=1 (no interference). This is a close approximation to the truth over large distances, and on small distances, where these measures are most relevant in linkage analysis, interference has little or no effect on the analysis outcomes, because c is strongly bounded with the following equation:51

Let us assume that we have two markers, X and Y, as above, whose positions are known, and we want to compute the value of some statistic relating each of those to some disease locus, Z. Now, let us make the further assumption that the loci are actually in the order XYZ, and that we know θXY and θYZ. Looking at meioses from a parent with phased genotype (1X_1Y_1Z/2X_2Y_2Z), we can express the correlation coefficient between the inheritance of the alleles at loci X and Y as

such that

Note that when we assume the absence of interference (c=1), this implies that ρXZ=ρXYρYZ. Thus, if one wanted to use marker X as a surrogate for marker Y in a linkage analysis, the sample size increase needed would be, following the theory above,

Thus, the correlations among loci owing to linkage are sufficiently strong to allow for significant reduction in the number of positions across the genome at which one needs to examine chromosomal segregation in families, and thus reduction in dimensionality in genome scans.

In contrast, in the case of LD, the correlation coefficients are not multiplicative. To illustrate the lack of multiplicativity of r2 estimates for SNP markers, we used all the pairwise r2 estimates from the International HapMap Project,5 release date 16 June 2005. In this analysis, we considered all triples of markers XYZ, and in Figure 2, on the horizontal axis we give the reported estimate for r2XZ, and on the vertical axis we give the product r2XYr2YZ for all triples of markers. If the correlation coefficients were multiplicative, as they are for linked marker loci in linkage analysis, the graph would be basically a straight diagonal line through the origin (x=y). As you can clearly see, however, this is not the case at all. Rather, there is precious little information about the correlation coefficient between X and Z, which can be gleaned from knowing the values of the correlation coefficients between markers X and Y, and that between markers Y and Z. Thus, the theory described below, which drives the HapMap project clearly does not hold in general for correlation coefficients, and in fact can be grossly misleading, most strikingly so when there is substantial LD – the very situation HapMap is designed to model.

Figure 2
figure 2

Non-multiplicativity of r2 estimates: r2 estimates reported in the 16 June 2005 release from the International HapMap Consortium, based on CEPH (CEU), CHB, JPT and YRI data sets are demonstrated to be nonmultiplicative. All triples of SNPs were considered: X–Y–Z, from all autosomal chromosomes and on the x-axis is the estimate of r2XZ, and on the y-axis is the product of the estimators of r2XY and r2YZ. If correlation coefficients were multiplicative, the graph should be the straight line x=y, but as can be clearly seen, this is not the case, and the values of r2XY and r2YZ can be seen to provide very little information about the correlation coefficient between the flanking markers X and Z, which would not be the case if the assumptions of the Fundamental Theorem of the HapMap held in general.

Fundamental theorem of the HapMap

Statements about the relationships between LD and power of association studies like those made in the Gabriel et al53 paper are based on theory, which assumes a multiplicative relationship among estimated correlation coefficients for different factors32, although it is well known that correlation coefficients are not generally multiplicative. For example, Czechs have higher alcohol consumption than Finns, and men have higher alcohol consumption then women.55 If the correlation coefficients describing these relationships were multiplicative, then one would arrive at the false conclusion that this implies that being male was correlated with being Czech. Justification for moving forward with HapMap as a tool for genome-wide association studies has been based on extrapolations from the aforementioned theory relating χ2 statistics to correlation coefficients. Let us define this hypothesized relationship formally as follows:

Theorem (‘Fundamental Theorem of The HapMap’): If ρAB is the correlation coefficient between alleles of two SNPs, A (with alternate allele a) and B (with alternate allele b), and if sample size NBC would be sufficiently large to detect a correlation between phenotype C (with alternate phenotype c) and functional allele B, then the sample size, NAC, needed to detect a correlation between nonfunctional allele A and the same phenotype would be NAC=NBC/ρAB2.

However, remember that earlier we demonstrated that in general, such that the implicit assumption is that or in other words, that ρAB2rBC2=rAC2. This is analogous to the highly deterministic relationship among linked marker loci in the case of linkage analysis, ρXZ=ρXYρYZ, which holds in the absence of crossover interference, as shown above. Note that this multiplicative relationship only holds because of the great regularity of the correlations generated by the recombination process in meiosis. The generalization of this relationship to LD studies (and the theory of correlation coefficients in general) is far from straightforward and requires strong additional assumptions for it to hold, not to mention large samples, as r2 is an upwardly biased estimator of the squared correlation coefficient, as shown above.

It can be shown in general (see Appendix A) that the relationship ρAC=ρABρBC implies the independence of A and C conditional on B, that is to say P(ABC)=P(ABc)=P(AB), and so forth. In the context of association studies, the Fundamental Theorem of the HapMap implies that the frequency of SNP allele A conditional on allele B at the functional site is invariant between cases and controls, implying that the only reason allele A might differ in frequency between cases and controls would be because of differences in the frequency of allele B. For purposes of the following discussion, we ignore the effects of ploidy, to simplify the algebra and need to specify specific dominance relationships. In the context of this discussion, we refer to A and B as alleles on a haplotype drawn at random from an individual, such that P(CB) refers to the probability that the person from whom a B allele is selected at random is affected. Note that in terms of a penetrance model, P(CB)={P(CBB)P(BB)+0.5P(CBb)P(Bb)}/P(B), which is exactly what would be compared in a cohort study comparing allele frequencies with a dichotomous phenotypic outcome.

Bounds on unconditional and conditional ρAC given B or b

Because all pairwise haplotype frequencies are probabilities constrained to be between 0 and 1, there are algebraic restrictions on the range of ρAC as a function of P(A) and P(C), as follows:

Furthermore, if we know ρAB and ρBC, P(A), P(B), and P(C), then we also know uniquely P(AB), P(Ab), P(CB), and P(Cb) as

and

and

and

These restrictions imply that the restrictions on the conditional values of ρAC given B and b are different, since

This implies that there are additional restrictions on the unconditional values of ρAC, and thus ρAB and ρBC contain information about ρAC, even though they do not determine it uniquely.

By definition,

and this quantity has its lower bound when

and its upper bound when

To see what these constraints mean about the information about tertiary correlations contained within sets of pairwise correlation coefficients, consider the following set of assumptions about marker and disease loci, which are best-case scenarios for association studies: P(A)=0.1 (tag SNP with rare allele frequency of 10%), P(B)=0.1 (functional variant with rare allele frequency of 10%), P(C)=0.025 (disease prevalence of 2.5%), ρAB=0.9 (r2=0.81 between the rare alleles of the functional variant and the tag SNP). Translating the constraints described above into parameters that are more comprehensible to the genetic epidemiologist concerned about practical ramifications, we graph the bounds on r2AC in Figure 3a, as a function of the relative risk of allele B, which is defined as

Figure 3
figure 3

Bounds on r2AC are shown graphically for simple random sampling. In this case, the allele frequencies for minor alleles of both functional locus B and tag SNP A are set to 0.1 and the correlation coefficient between them, ρAB=0.9. (a) Upper and lower bounds on r2AC as a function of the relative risk of the functional variant B on phenotype C, with the curve in the middle representing the theoretical prediction under multiplicativity of correlation coefficients. (b) The same bounds but with the y-axis representing ρBC2/ρAC2, which is the increase in sample size actually needed when typing SNP A instead of functional site B. Note that in (b), the theoretical prediction is that this ratio should be 1.2 for all values of the relative risk.

The horizontal bar on top is the upper bound on r2AC imposed by P(A) and P(C) alone, the upper curve is the upper bound as a function of RRB ranging from 0.1 to 10, and the lower curve is the predicted value of r2AC from Gabriel et al.53 Note that throughout this range, the lower bound on r2AC is 0, that is, it is possible to have no correlation whatsoever between the phenotype and the tag SNP, even with correlation coefficient of 0.9 between the SNP and the functional polymorphism!

Since it is r2AC that is directly related to power, and since it was indicated above that NAC r2AC=χ2 for a given data set, then if we wanted to obtain the same significance from an association study using SNP A instead of functional variant B, to satisfy the relationship NACrAC2=NBCrBC2, the relative sample size needed using the surrogate tag SNP would be N=NAC/NBC=rBC2/rAC2. Figure 3b graphs the upper bound on this relative sample size over the range of RRB extending from 1 to 100. The upper value is well beyond the plausible range for variants being sought in complex trait association studies, with RRB between 1.5 and 3 being the range most people claim to be interested in. Note that for these fairly reasonable assumptions about the frequencies of the variants, and the high correlation coefficient of 0.9 between the SNP and the functional variant, the sample size has an upper bound of infinity over much of the range considered. Note that the thin horizontal line at N=1.23 is the predicted increase in sample size needed claimed by Gabriel et al53 in their naïve application of the theory of correlation coefficients.

For the reader interested in exploring the effects of altered values of the various parameters, this can be done using the Excel spreadsheet found at http://linkage.cpmc.columbia.edu/excel/rsquaredAC.xls.

Ascertainment bias

Of course, in real-world epidemiological studies, one does not use simple random sampling of the sort for which these mathematical models were derived to fit. In practice, one would ascertain individuals from the population conditional on the trait (outcome C in our nomenclature) because this systematically increases the power by enriching for the rare outcome variable. Mathematically this has the effect of increasing the magnitude of ρBC as follows: since under simple random sampling (r.s.),

then if one were to sample from the population conditional on outcome such that the sample had proportion pC of cases and (1–pC) of controls, the value of ρBC under case control sampling (c.c.) would be

where pB=P(BC)pc+P(Bc)(1−pc) such that the correlation coefficient is increased by a factor of

Thus, the sample size needed for an equivalent expected χ2 statistic under case–control sampling with proportion of cases sample set at pC would be

A definition of invariant LD among SNPs under case–control sampling in the spirit of the underlying assumptions of the ‘Fundamental Theorem of the HapMap’ would be that P(AB) and P(Ab) remain invariant with phenotype. Even if this were true, ρAB must be different, since

While the first part of this equation might be invariant in random sampling or case-control sampling, the second term cannot, since both pB and pA would vary due to ascertainment bias if B were functional.

Ascertainment bias influences the value of NAC, the sample size requirement under case control sampling as follows:

where NAC(c.c.) refers to sample size needed under case control sampling, while NAC(r.s.) refers to the analogous quantity under simple random sampling. While we demonstrated above that the sample size requirement under case–control sampling is reduced whenever P(C)<0.5 in the population, it is not necessarily true that the sample size requirement is decreased when typing SNP A as a surrogate for the functional variant, even when ρAB is high in the population. Figure 4a shows the upper and lower bounds on ρ2AC over the same range given for random sampling in Figure 3a, while Figure 4b shows the same bounds for 1/ρ2AC as a function of the relative risk, as in Figure 3b. Note that while ρ2AC may obtain much higher values under case-control sampling than simple random sampling, the range includes 0 for even high relative risks for B. This means that the sample size requirement (equivalent to that in Figure 3b for random sampling) must include infinity as an upper bound over the entire range. Thus, case–control sampling, theoretically, can lead to even less power than simple random sampling under some models, even when ρAB in the population is as high as 0.9, as it was in the example. It is important to note as well that ρAB in a case–control design cannot be uniquely determined as a function of the population ρAB, the relative risk of disease given functional variant B, and the frequencies of A, B, and C, further complicating predictions about power in that context. Nevertheless, the fact remains that case–control sampling can potentially reduce power over random sampling, when using a tag SNP, A, as a surrogate for some functional variant, B, in an association study with disease C!

Figure 4
figure 4

Bounds on r2AC are shown graphically for case-control sampling. (a) is the equivalent graph to Figure 3a for case-control sampling, and (b) is the analog of Figure 3b for case-control sampling. The assumed models are the same as in Figure 3.

Effects of allelic heterogeneity

Let us now examine a very simple situation in which the implications of the ‘Fundamental Theorem of the HapMap’ would be totally misleading. Let us consider three SNPs in a haplotype block, such that two of the three, A and B are functional (with risk alleles DA and DB respectively and normal (wild type) alleles +A and +B respectively), and have equivalent effect on the trait, in a dominant manner, such that P(AffectedDA/x)=P(AffectedDB/x)=f, P(Affected+A/+A, +B/+B)=0, and C (a SNP with 2 alleles 1C and 2C) has no phenotypic effect (that is to say that presence of a disease allele at either locus A or locus B on one or both chromosomes gives an individual probability f of being affected, and if neither disease allele is present, the individual is healthy with probability 1). If we assume that all the pairwise D′ values are 1, meaning that there has been no recurrent mutation or recombination historically within this block, there would be four haplotypes with nonzero frequency, for example, H1=P(+A+B 1C); H2=P(+A+B 2C); H3=P(DA+B 1C); H4=P(+A DB 2C). If we were to set all four haplotype frequencies to be equal H1=H2=H3=H4=0.25, for example, then ρAC2=0.333; ρBC2=0.333, such that the ‘Fundamental theorem of the HapMap’ would predict that if marker C were used as a surrogate for A in a case-control association test, the sample size needed .

However, the true detectance distribution for this model is shown in Table 1. There would be power to detect the relationship between either functional variant and the disease, with an odds ratio of 1.55 for the risk allele at each of the disease loci, but the odds ratio is 1 with the SNP that had r2 of 0.333 with each of the disease loci. The ‘Fundamental Theorem of the HapMap’ would have predicted that a sample size three times larger than needed to detect either functional variant would be sufficient to detect the association with the SNP C, but this is clearly untrue. This simple example is admittedly extreme, since both alleles are assumed to be very common, in accordance with the ‘common disease/common variants’ hypothesis, widely touted by the same scientists that are promoting HapMap8, 21, 56, 57, 58, 59, 60 Nonetheless, this example clearly shows that even with tight haplotype blocks, and common disease alleles, it is possible that functional variants can be detected if they are genotyped in a sample, and yet there might be absolutely no difference between cases and controls whatsoever for other common markers within the same haplotype block.

Table 1 Effects of allelic heterogeneity in a simple case on the multiplicativity assumption

If one allows for more substantial allelic heterogeneity, as is typically seen in most loci that have been studied in sufficient detail, this effect will be exacerbated, because the less frequent the individual variants are, the greater the likelihood that they originated on a variety of haplotypes (according to their population frequencies), so if they fall within the same ‘haplotype block’ they will likely be in opposite phase with any given SNP which is being genotyped. The greater the number of variants, the greater the similarity in the detectance distributions for the haplotypes in cases and controls, if one fails to genotype the functional variants themselves! Furthermore, since it appears likely that in general there is an inverse relationship between effect size and allele frequency, this would further homogenize the distributions of haplotype frequencies for common tag SNPs between cases and controls, making it trivial to construct examples for which there is substantial power to detect the functional variants themselves, if genotyped, in a case-control study, while there would be no power in an infinite sample for tag SNPs, even with very high r2 LD of 0.8. In populations with LD extending over longer distances, the problem becomes more acute, as there are many more loci in LD with any putative marker, any of which might themselves be functional, so while fewer markers would be needed to do a genome-wide association if one chose markers based on the ‘Fundamental Theorem’, there would be many more sites with correlated exposure frequencies that might be potentially functional, increasing the potential magnitude of this problem. To this end, one might think twice before deciding to use fewer markers in association studies in isolates than elsewhere. An Excel spreadsheet is available from the authors in which 3 locus haplotypes can be input under general penetrance and haplotype frequency models to examine these detectance distributions and compare them to the predictions of the Fundamental Theorem of the HapMap, can be found at http://linkage.cpmc.columbia.edu/excel/r-squared.xls.

Software, SIMQTL, for analysis of more complex models under more sophisticated ascertainment schemes is also available from the authors. It should be kept in mind that while such a multiplicity of risk alleles substantially decreases the power of association tests, it generally tends to increase the power in linkage studies. The identities of specific alleles are not examined in linkage analysis, only the sharing of any alleles (whatever their molecular configuration) IBD among relatives in a pedigree, so that whenever functional variants are linked to one another, the power of a linkage study will increase substantially, even when those loci are as far apart as several Mb!

Discussion

The proponents of association-based mapping strategies argue that since A has no functional effect on C, any correlation between A and C must be because A is correlated with B and B is correlated with C, justifying the assumption that P(ABC)=P(ABc)=P(AB). Simple algebraic manipulations show that this condition is equivalent to saying that P(CAB)=P(CaB). At first glance, this seems to be a reasonable assumption, namely that if A is nonfunctional, then the probability of any given phenotypic outcome in general is independent of A.53, 54 However, there are other ways in which having haplotype AB can influence the risk of a phenotypic outcome differently from haplotype aB. We argue that it is the exception, not the rule, for such conditional independence to hold in genetic studies of complex traits, and that the assumptions of Gabriel et al53 rarely hold in practice. Certainly, blanket statements about the relationships between r2 and sample size requirements for association studies are not factual, since typically . In fact, equality of these terms only holds in the best-case scenarios, analogous to linkage analysis. Conditional independence in genetics is rarely an appropriate assumption (as people have been learning the hard way in attempts to look at linkage analysis with massively dense sets of SNP markers4, 37, 48, 61, 62, 63, 64, 65, 66), and this is the primary reason why geneticists avoid using classical statistical techniques in favor of complex likelihood-based models when making inferences. It is imperative to remember that statistical independence is a very different thing from causal independence, and is a very strong assumption, which can have enormous consequences.

Above, we have provided a simple example where conditional independence does not hold owing to allelic heterogeneity. While allelic heterogeneity is one potential reason for deviation from the theory, it is certainly not the only way. Ethnic heterogeneity, in which the frequencies of both phenotype and SNP marker alleles vary can create not only false positives, as is by now well-appreciated, but can just as easily create ‘false negatives’ – that is to say tag SNPs may show no correlation at all with the phenotype, even when an easily detectable functional variant has r2 of 0.8 with the tag SNP, for similar mathematical reasons. Likewise, environmental risk factors can have similar sorts of confounding that can easily cancel out the effects of a functional variant, when typing a tag SNP instead. And here we are only considering cases in which the functional variant does have power to be detected if it were genotyped and measured. The fact is that correlation coefficients are almost never multiplicative in practice, and in studies involving genetic risk factors for disease, we have known for decades that conditional independence of exposures never holds. In fact, this is the entire basis for the development of the complex likelihood methods we have relied upon for the past decades in understanding the genetic basis of simple diseases.

Sequencing samples consisting of cases and controls in candidate regions in the genome to estimate the amount of LD measured by r2 among the SNPs they identify, in order to select ‘tag’ SNPs for further study, is a strange approach, because, as we showed above, r2 must vary between cases and controls whenever one of the markers has a functional effect. Often people look at the r2 in cases and controls, and if it is not different, they pool the data and use that to select a marker, but as shown above, this will always bias the estimates. For that matter, if the r2 really is invariant in cases and controls, this is evidence against alleles of either SNP being functional, which is reason to potentially not type either of them. Again, it is important to be careful about the assumptions, the theory, and their ramifications, rather than proceeding naively based on arguments like ‘but that is what everyone else is doing, so it must be right’, which we have all been subjected to.

We hope to make readers think twice before engaging in high risk studies without fully evaluating the potentials for confounding factors such as those described here to complicate the theoretical predictions. Human geneticists routinely rely on mathematical theory and predictions without fully understanding the assumptions driving the theory, or contemplating their implications. If nothing else, it is hoped that statistical geneticists would be more forthcoming and explicit about the theoretical ramifications of the model assumptions, as this is but one example where they fall apart. Similar difficulties and inconsistencies between theory and practice can be widely seen in such areas as studies of gene–gene and gene–environment interaction, where independence of exposures is assumed, and deviations from independence of exposures conditional on phenotype are inferred to imply etiological interactions, when nonindependence of exposures and ascertainment bias are equally capable of explaining such phenomena without the need to invoke complex etiological interactions. Another obvious example of inconsistent theory and practice would be when linkage analyses of extremely concordant and discordant sibling pairs are performed, assuming some component of variance due to polygenic factors, and yet the null hypothesis in the linkage analysis predicts that at random genomic locations, 50% of the genome of such sibs should be IBD (when of course the polygenic factors that are individually too weak to detect, must alter this average genome-wide sharing if the analysis shows they exist…).

We hope that as gene hunting approaches increase in cost and size, that rather than becoming more cavalier about theoretical assumptions, that we be much more careful about what we believe. Technological advances are wonderful, and make it possible to do science that we could not imagine a few decades ago, but excellent technology applied to poorly designed studies (driven by assumptions the investigators themselves probably would not really believe if consciously aware of them) are not particularly wise ways to do science – it would be far better to spend more time thinking and planning before jumping in to genotyping every sample we can get our hands on, lest no one listen to us when we cry fire and there actually is one, at some point in the future.