For most common diseases, including heart disease, diabetes, hypertension, and cancer, multiple genetic and environmental factors influence an individual's risk of being affected. This complexity contrasts with the inheritance pattern of monogenic disorders, in which the presence or absence of disease alleles usually completely predicts the presence or absence of disease (although the severity or age of onset may vary). For genetically complex diseases, risk alleles are less deterministic and more probabilistic—the presence of a high-risk allele may only mildly increase the chance of disease. Furthermore, it has been proposed that these weakly penetrant alleles may be present at high frequency (>1%) in the population.13

The widespread presence of high frequency variants in humans was first shown experimentally by Harris among others,4 who found that many proteins have several common, heritable isoforms, thereby demonstrating that common genetic variation could lead to variation in protein structure. The widespread presence of such variation suggested that common variants might be biologically important. As Harris4 hypothesized in 1971 (see p. 272), “The other group of alleles, though numerically much fewer, are individually much more common. They [common DNA variants] provide the basis for the great variety of enzyme … polymorphisms which evidently occur. These are quite possibly the underlying biochemical cause of much of the inherited diversity in the physical and physiological characteristics of individuals, and also in relative susceptibilities to various diseases and other disorders.” Unfortunately, tests of this hypothesis were limited to proteins for which common functional variation could be easily assayed (primarily a few enzymes and determinants of blood group antigens).

The advent of gene cloning and sequencing substantially lowered this technical hurdle. It became possible to easily detect DNA variants in a given gene. The first genetic variants tested were usually restriction fragment length polymorphisms (RFLPs), but with the development of the polymerase chain reaction (PCR) and other improvements in technology, microsatellites, variable number tandem repeats (VNTRs), insertion/deletion polymorphisms, and single nucleotide polymorphisms (SNPs) could all be analyzed.

By determining the genotype of these variants in individuals with disease and in unaffected controls, these polymorphisms could be tested for association with susceptibility to a variety of diseases. Such studies, called “association studies,” have usually used a case-control design (although family-based designs have also been used; see below). In this design, the frequencies of the alleles or genotypes at the site of interest are compared in populations of cases and controls; a higher frequency in cases is taken as evidence that the allele or genotype is associated with increased risk of disease. The usual conclusion of such studies is that the polymorphism being tested either affects risk of disease directly or is a marker for some nearby genetic variant that affects risk of disease.

These association studies were further facilitated by the increasingly rapid discovery of common polymorphisms in genes, accomplished by resequencing the same stretch of DNA in multiple individuals. One of the goals of the human genome project has been to identify large numbers of SNPs; indeed, the number of SNPs in public databases is now well over 1,000,000.5 As we describe below, association studies have already identified over 600 potential associations between common genetic variants and susceptibility to common disease. As the availability of known polymorphisms skyrockets, so too will the number of reported associations. It is, therefore, critical to have a framework in place by which one can evaluate and interpret these associations.

The purpose of this publication is to list and put into perspective many of the examples of associations in the recent literature, thereby providing an interim picture of this exciting and rapidly developing field. In addition, we will examine in detail two illustrative examples: (1) the association between deep venous thrombosis and factor V Leiden, a common polymorphism in the gene encoding clotting factor V, and (2) the association between various diseases and a common polymorphism in MTHFR, the gene encoding methylene tetrahydrofolate reductase. Finally, we will suggest some guidelines for the analysis of association studies, because proper evaluation of these associations is critical both to understanding the genetics of common disease and to informing recent discussions regarding screening for common genetic disease.


We performed two independent reviews of the literature from 1986 through 2000 to identify published significant associations between common diseases or dichotomous traits and common polymorphisms in or near genes (sites of genetic variation in which the minor allele frequency is at least 1%). We excluded monogenic disorders, because linkage analysis and positional cloning methods have been highly successful in identifying the alleles responsible for these diseases. Because of the large amount of prior literature, we also did not consider polymorphisms in HLA or blood group antigens, even though there are many robust associations between variation at these loci and disease. For simplicity, we have only included associations between variation at a single locus and susceptibility to disease in the entire population under study in the publication. In particular, we have not included associations between pairs of loci and susceptibility to disease nor associations between a polymorphism and susceptibility to disease in a subgroup of patients (such as smokers or those receiving hormone replacement therapy). Thereby we have explicitly ignored reports of gene-gene and gene-environment interactions, even though some of these interactions may well be of great biologic and clinical interest. Finally, we have not listed associations with substance abuse (where phenotype definition is often murky), associations between polymorphisms and variation in laboratory findings (such as serum calcium levels), or associations with other quantitative, continuous traits (as opposed to dichotomous traits). Associations were considered significant if the nominal P value was < 0.05 or if the 95% confidence intervals for relative risk excluded 1.00.


We identified 268 genes that contain polymorphisms reported to be associated with 1 of 133 common diseases or dichotomous traits. In total, these 268 genes accounted for 603 different gene-disease associations. These associations are listed in Table 1, grouped according to the trait or disease under study. As seen in Figure 1, the number of new genes associated with diseases or traits has risen more or less steadily from 1993 to 2000. The temporary drop-off in 1999 and early 2000 likely reflects an emphasis on testing newly identified polymorphisms in previously studied genes (data not shown). Examination of Table 1 also shows that many genes have been associated with several different diseases; for example, polymorphisms in TNF, the gene encoding tumor necrosis factor alpha, have been associated with 20 different diseases or traits, whereas variants in ACE (encoding angiotensin converting enzyme), VDR (encoding the vitamin D receptor), and MTHFR (encoding methylene tetrahydrofolate reductase) have each been associated with over a dozen different diseases or traits (see also supplementary Table 1). As illustrative examples, we examine in more detail two of the associations in Table 1: the association of F5 (clotting factor V) and deep venous thrombosis, and the association between MTHFR and a variety of diseases.

Table 1 Associations between common polymorphisms in genes and common diseases or dichotomous traits
Fig 1
figure 1

The number of new, previously unreported, significant associations between diseases or dichotomous traits and genes is plotted for each year from 1984 through 2000. The graph does not include new associations between a disease or trait and polymorphisms in a gene for which other polymorphisms had previously been significantly associated with that disease or trait.

The original report of an association between F5 and deep venous thrombosis grew out of observations that resistance to activated protein C, a biochemically defined phenotype, was associated with markedly increased risk of deep venous thrombosis.6 In an elegant study, the molecular basis of activated protein C resistance was shown to be a single nucleotide polymorphism in F5 encoding an arginine to glutamine change in codon 506 (Factor V Leiden; see Bertina et al.7). This change occurs at one of the protein C cleavage sites, thereby preventing inactivation of factor V by activated protein C and leading to a hypercoagulable state.8 Subsequent studies of this polymorphism have repeatedly demonstrated association with susceptibility to deep venous thrombosis, with P values often at or below 10−4 in individual studies (for example, Salomon et al.9). These studies were performed in several different populations, although the range of populations available for study is limited by the fact that Factor V Leiden is uncommon in non-Caucasian populations.10 Thus this association is extremely robust in addition to having high biologic plausibility.

By contrast, associations involving common variation in MTHFR have not been as reproducible. A common thermolabile variant of methylene tetrahydrofolate reductase was first described in 1991. Thermolability of enzyme activity is inherited as a recessive trait11 and was eventually shown to be due to homozygosity for the “T” allele at a C/T polymorphism in nucleotide 677 (causing an alanine to valine change, see Frosst et al.12). Unlike the rare, more severe mutations in MTHFR which cause homocystinuria, the variant was not associated with neurologic deficits. However, thermolability of enzyme activity was observed to be associated with altered homocysteine levels and risk of coronary artery disease,11 findings that were confirmed in at least one subsequent study that looked at nucleotide 677 (see Gallagher et al. and Kluijtmans et al.13,14). Folate metabolism and homocysteine levels are connected with several clinical disorders, including coronary artery disease, deep venous thrombosis, neural tube defects, and cancer (see Gailey and Gregory15 for review); the thermolabile variant has been associated in different studies with increased risk of each of these diseases.13,14,1620 However, despite the biologic plausibility of these associations, none have been reproducibly observed across many studies (for example, Ma et al.2123).

If all of the associations listed in Table 1 could be replicated as consistently as factor V Leiden and deep venous thrombosis, this list would represent a significant understanding of the etiologies of most of the major human diseases. However, genetic associations more often behave like those seen with MTHFR: they are not consistently reproducible. To determine what fraction of the associations in Table 1 were robust, we first identified those associations for which an assessment of reproducibility could be made. These 166 associations (those for which we could find and review at least three separate publications) are listed in Table 2. Where more than one polymorphism in a gene was studied, the polymorphisms were treated separately. Although a significant effort was made to be complete, there are undoubtedly some well-studied associations that are not listed in Table 2. Nevertheless, we believe that this list is a reasonably accurate representation of the state of published association studies between polymorphisms and common genetic disease.

Table 2 Disease-polymorphism associations for which at least three studies were identified

We reviewed the 166 associations in Table 2 to determine whether other studies of the same polymorphism and disease also reached statistical significance. Only six associations were reproduced at a high level of consistency (statistical significance was achieved in 75% or more of all identified studies). These six associations are listed in Table 3. The possibility of publication bias and consequent omission of “negative” studies means that six is actually an upper limit for the number of consistently reproducible associations. Of the associations in Table 3, the most reproducible was the association of ApoE4 and Alzheimer's disease, for which dozens of reports reach statistical significance. It should be noted, however, that the association is most robust in Caucasians (all identified reports achieved statistical significance); for other ethnic groups (Africans, African-Americans, and Hispanics), the association is sometimes more difficult to demonstrate.2426

Table 3 Highly consistently reproducible associations (≥75% positive studies)

What could be the cause of the irreproducibility that characterizes the vast majority of association studies? One possibility is that the original observations represent statistical fluctuations (type I error). If this were the case, one would predict that only 5% of subsequent studies would also reach statistical significance with P < 0.05, and most associations would never be observed again. However, of the 166 associations listed in Table 2, at least 97 were observed again, many of them multiple times. Thus in the absence of a massive publication bias (selective publication of positive results with numerous negative studies remaining unpublished), statistical fluctuation is unlikely to explain all of the initial positive reports in Table 2.

Other possible causes of false-positive association studies have been previously identified and include ethnic admixture resulting in population stratification, variable linkage disequilibrium between the polymorphism being studied and the true causal variant, and population-specific gene-gene or gene-environment interactions.2730 Each of these issues is addressed briefly in turn below, and possible remedies are offered. Finally, we examine the possibility that weak genetic effects combined with underpowered studies lead to significant numbers of falsely negative reports.


Most association studies have a case-control study design, in which allele or genotype frequencies in patients are compared with frequencies in an unaffected control population (Fig. 2 a). This study design is subject to population stratification due to ethnic admixture, which occurs when the cases and controls are unintentionally drawn from two or more ethnic groups or subgroups. If one of these subgroups has a higher disease prevalence than the others, stratification occurs, because that subgroup will be overrepresented in the cases and underrepresented in the controls. Any polymorphism that genetically marks the high-risk subgroup (i.e., is found by chance at a higher frequency in that subgroup), therefore, will appear to be associated with disease (Fig. 2 b) and will likely be a false positive. Interestingly, the frequencies of several of the alleles in Table 2 vary substantially between populations, consistent with the possibility of false associations due to ethnic admixture. It should be noted that well-defined subgroups are not necessary to observe stratification; stratification can also occur in a single admixed population where the individuals have varying degrees of genetic contributions from two or more ethnic groups. Even apparently homogeneous, isolated populations (such as Iceland) are in theory susceptible to admixture if there have been multiple distinct waves of migration from different source populations (e.g., Celtic and Norse, in the case of Iceland).

Fig 2
figure 2

True associations contrasted with false-positive associations due to ethnic admixture. The open shapes represent individuals with disease, and the filled shapes represent individuals from a control population. Shapes with a plus sign (+) represent individuals carrying the putative risk allele being tested for association. In both figures, the fraction of individuals carrying the risk allele is twice as large in the case population as in the control population. Figure 2 a (Top): True-positive association: the frequency of the risk allele is greater in cases than in controls in both ethnic groups. Figure 2 b (Bottom): False-positive association due to ethnic admixture: the frequency of the risk allele is identical in cases and controls in both populations. However, the allele is twice as frequent overall in cases as in controls. This false appearance of association is due to ethnic admixture, i.e., ethnic group 1 is overrepresented in the cases, and the allele being tested is prevalent in ethnic group 1 but not ethnic group 2.

What steps can be taken to prevent false-positive associations due to population stratification? Currently, two solutions can be attempted. First, one can use family-based studies such as the transmission disequilibrium test.31 This method, abbreviated TDT, requires affected offspring and their parents to test an allele for association with disease; the frequency with which heterozygous parents transmit that allele to offspring is then determined. This frequency is compared with the Mendelian expectation of 50:50 transmission of the allele. TDT (like other family-based methods) is immune to false-positives from ethnic admixture.31 Disadvantages of the TDT are that family-based samples are often difficult to collect and that 50% more genotyping is required than in case-control studies to achieve similar power (the exact loss of power depends on the underlying genetic model). Another possibility is to study multiple case-control populations, each from different ethnic groups, and require that an association be seen in each population. Finally, an approach to detect and correct for stratification has been proposed: by typing several dozen random markers, one can empirically determine the degree of stratification in a case control study.3234 If significant stratification is detected, one can use these markers to more carefully match cases and controls to remove the effects of stratification.35 There is some debate as to whether stratification is a significant problem; some authors believe that even minimal ethnic matching of cases and controls is adequate to prevent stratification.36 However, there are as yet no empirical data that address the degree of stratification found in a typical association study.


Failure of replication can also occur if the polymorphism being tested is not itself the causal variant but is rather in linkage disequilibrium with the causal variant. Linkage disequilibrium, in which nearby variants are correlated with each other more often than expected by chance, depends heavily on population history and on the genetic make-up of the founders of that population. If all examples of a particular stretch of DNA in a population derive from a recent common ancestor, there will have been few opportunities for recombination events to separate variants within that stretch of DNA and the variants will often be inherited together throughout the population. If, in a different population, the time since a common ancestor is longer, more recombination events will have occurred, disrupting linkage disequilibrium in the region. Furthermore, the particular arrangement of variants in the founders of a population will determine which variants are inherited together. Thus, it is possible that a polymorphism will be in linkage disequilibrium with a nearby disease allele in one population but not in another, leading to variable results of association studies. For example, many of the associations with TNF in Table 1 might reflect associations with nearby HLA loci (HLA is a region with strong linkage disequilibrium over large distances). To explore this possibility, positive associations should be followed up by testing adjacent markers (both individually and as multi-marker haplotypes). If linkage disequilibrium is present (and particularly if any of the haplotypes or adjacent markers show stronger association), the possibility exists that the original marker tested is not the causal allele, and further studies of the region are warranted. Although it should be possible to exhaustively test modest sized regions of linkage disequilibrium, special circumstances (e.g., recently admixed populations) may in theory give rise to correlation between markers at much greater distances.


Another potential source of variable findings is gene-gene or gene-environment interactions that differ between populations. For example, if the effect of a variant were only manifest in populations with a particular genetic or environmental background, then association would only be seen in populations or subgroups with the appropriate genetic or environmental characteristics. This explanation is commonly invoked to explain differing results of association studies but is less frequently supported by direct evidence. A further problem arises when considering gene-gene or gene-environment interactions: when combinations of alleles and/or environmental factors are studied, P values are rarely corrected for the number of tests reported (much less the number of tests actually performed). Such “nominally” significant results must be considered to be the product of hypothesis generation rather than hypothesis testing and, therefore, require replication. Perhaps the best possible method of demonstrating that a gene-environment interaction is likely to be correct (and not a statistical fluctuation expected when exploring numerous hypotheses) is to divide the study population randomly into two parts and require that any findings be observed in both parts of the study. Sample sizes need to be increased slightly to maintain power, but the ability to generate and then test hypotheses in the same sample would seem to outweigh this consideration. Otherwise, one requires a replication population that is exactly matched for environmental and genetic background, an extremely unlikely scenario.


Finally, associations can be real but nonetheless not reproducible if the underlying genetic effect is weak. If the subsequent studies are small in size, they will be underpowered to reliably detect weak effects and, therefore, fail to achieve statistical significance. This difficulty is heightened by the “jackpot” effect, in which the first group to publish a significant association involving a weak locus is more likely to have overestimated than underestimated the true effect of the polymorphism. This phenomenon occurs because each study imprecisely estimates the strength of the effect (due to sampling variation). Because a weak effect would in most cases not provide a statistically significant finding in a typically sized study (a few hundred cases and controls), the first published study that does manage to achieve statistical significance is almost certain to have overestimated the true effect of the variant being tested. Subsequent studies thus need to include much larger numbers of patients to achieve statistical significance. In particular, failure to observe the magnitude of effect seen in the first study should not be taken as a repudiation of the association. We observed this phenomenon for the association of type 2 diabetes and a Pro12Ala polymorphism in the PPARG gene, where an initial study estimated the effect on diabetes risk to be threefold,37 but subsequent studies observed very modest risks that usually did not achieve statistical significance.3842 We tested the variant in several large populations and found that the effect on diabetes risk was modest (1.25-fold) but significant (P = 0.002 in our data alone29). Indeed, all of the previous studies, both positive and negative, were consistent with this 1.25-fold effect, and two subsequent large studies confirmed this association.43,44 Because many alleles may have similarly weak genetic effects, large studies and/or meta-analyses of multiple studies will often be required to determine whether genetic associations between polymorphisms and disease are significant.


How does one tell whether reported associations between polymorphisms and disease are real? Reasonable criteria for declaring association have been proposed, including low P values, replication in multiple samples, and avoidance of population stratification (such as by using family-based controls28). However, most studies do not meet these criteria, and multiple studies of an association are usually inconsistent. In these cases, meta-analysis of all published studies may guide interpretation, and we strongly advocate that any publication of an association study (whether negative or positive) be accompanied by a meta-analysis of all similar studies. Accordingly, individual researchers should also publish or make easily available sufficient information to facilitate future meta-analysis, including relevant genotype and phenotype data. Publication bias may present a major challenge to such analyses, because the omission of small negative studies will bias the pooled data toward a positive result. In this regard, we advocate a mechanism for storage and dissemination of all association data (published or not), perhaps in a widely accepted and curated Web site and/or in brief “negative results” sections of specialty journals. Until complete meta-analyses can be performed using data from multiple large studies, we will be left with a scenario in which the majority of reported associations are in genetic purgatory, neither convincingly confirmed or refuted, awaiting future judgment.

Much of the interest surrounding genetic association studies centers on the potential clinical application of polymorphisms that serve as markers for disease. In particular, it has been proposed that these markers can both serve as predictors of disease and as a means to tailor treatment of disease. Although this scenario may well become reality, the current irreproducibility of most studies should raise a loud cautionary alarm. Certainly, clinical applications of genetic associations should not be considered until the degree of certainty far exceeds the level currently achieved for the vast majority of such associations. Furthermore, even if an association is supported by extremely convincing evidence, screening patients is only appropriate if determining an individual's genotype would allow a clinically proven beneficial intervention that outweighs the risk of performing the test. Genetic tests also give rise to ethical considerations, because of the implication for family members, the potential for discrimination, the immutability of genetic risk factors, and the predictive nature of such tests. (Although, given the probable modest effects of any particular genetic variant, most genetic tests are likely to be much less predictive of future health than widely used screens such as blood pressure and cholesterol measurements.) Societal consensus and legislative solutions addressing these ethical concerns are needed before such testing enters widespread clinical practice.

Because of the scientific and ethical uncertainties, a “DNA chip” that can determine crucial genotypes and accurately predict future health is unlikely to become a widespread and useful screening tool in the near future, even if concerns regarding reproducibility can be resolved. Rather, the most likely short-term benefit from genetic association studies will be a better understanding of disease pathogenesis, which will hopefully lead in turn to novel and better treatments and/or more tailored drug therapy. If genetic association studies can provide these sorts of advances, they will have proven a valuable resource in the struggle to understand and treat common disease.

Table 4 Gene symbols with OMIM numbers and aliases/descriptions