GENETIC BASIS OF COMPLEX TRAITS AND COMMON DISEASE

Many common diseases cluster in families in patterns that demonstrate that genetics plays a role in determining susceptibility. For example, the identical twin of a patient with type 1 diabetes will also get type 1 diabetes 30–50% of the time; dizygotic twins (who share a common environment but only 50% of their genes) are much less concordant (1,2). A sibling of a patient with type 1 diabetes is 15 times more likely to get diabetes than an unrelated individual (3), also suggesting a strong genetic component to disease susceptibility. In the case of schizophrenia, Risch (4) further demonstrated that the risk of disease falls off rapidly for relatives of schizophrenic individuals with decreasing genetic relatedness, consistent with a model where variation in multiple genes combines to influence disease risk. The increased risk to relatives (λ) is one measure of the influence of genetics; another measure of the contribution of inherited factors is termed heritability (h2), which signifies the fraction of the population variation that can be explained by genetic factors working together in an additive fashion. Heritability can be estimated either from family studies or twin studies, and is often between 30 and 50% for common diseases such as diabetes (1,2,5), or quantitative traits such as body mass index or blood pressure (6,7). Thus, multiple genetic factors and nongenetic factors combine to influence the risk of common diseases and quantitative traits. Because multiple genetic and nongenetic factors interact to affect phenotype, these diseases and traits are termed complex genetic traits.

For most common diseases and complex traits, the underlying genetic variation remains unknown. In principle, the relevant genetic variation could be rare (alleles with frequencies well under 1% in the population), as is true for most single-gene disorders, or more common (frequencies above 1% in the population). The frequency spectrum of the alleles for complex traits is important to consider, because it will guide approaches to finding the causal genetic variants (see below). Although the allele frequencies are largely unknown (because the causal variants remain largely unknown), theoretical and empirical considerations suggest that for common diseases and complex traits, some of the causal genetic variants may be common (810). In particular, most of the genetic variation that one encounters in the human population is explained by common variants with allele frequencies of 5% of greater (11,12). Because the bulk of random genetic variation is presumably evolutionarily neutral (neither under strong positive or negative selection), this suggests suggesting that disease variants that are not strongly evolutionarily deleterious will likewise be common in the population (10,13). But, we already know for most single-gene disorders that the responsible alleles are generally quite rare [unless there is a balancing selective pressure, such as malaria resistance for sickle cell disease (14)]. Why should the variants that cause common disease or affect quantitative traits not also be predominantly rare?

There are at least four arguments in favor of a role for common variation in complex traits. First, common diseases (or high or low quantitative trait values) are generally not as evolutionarily disadvantageous as single-gene disorders, which often cause early death or at least markedly decreased reproductive capability. Second, the variants that cause single-gene disorders are highly penetrant (meaning each variant is sufficient, or nearly so, to cause disease), whereas multiple variants are required to cause common diseases or strongly influence quantitative traits. Thus, the impact of selective pressure is diluted for the variants for complex traits. Third, most single-gene diseases are rare, whereas most polygenic diseases are common; population genetic arguments predict that for common diseases, some of the causal genetic variation should have a high frequency in the population, due to the demographic history of the human population (10). Similar arguments can be made for quantitative traits that vary across the entire population. Finally, empirical evidence suggests that common variants do contribute to the risk of common diseases (15).

APPROACHES TO FINDING THE GENES FOR COMPLEX TRAITS

Two major approaches have been used to map genetic variants that influence disease risk: linkage analysis and association studies. In linkage analysis, a genome-wide set of a few hundred or a few thousand markers spaced millions of bases apart is typed in families with multiple affected relatives (or multiple relatives in whom a trait has been measured). Markers that segregate with disease (or the trait) in relatives more often than expected are used to localize the disease genes. This approach has the advantage of being an unbiased, comprehensive search across the genome for susceptibility alleles, and has been successfully applied to find the genes for many single-gene disorders. However, linkage analysis has been less successful for polygenic diseases and quantitative traits [(16); see (17) for discussion], perhaps in part because of a limited power to detect the effect of common alleles with modest effects on disease (18,19).

Association studies look for a particular marker to be correlated with disease (or trait values) across a population rather than within families. These studies have much greater power to detect the effects of common variants (4). For example, the insulin VNTR class III allele, which has a frequency of approximately 70%, has definitively shown to modestly affect the risk of type 1 diabetes, with a p value of 10−22, using association studies (20). By contrast, the region containing the insulin gene is just barely above the threshold for statistical significance even when all of the world's linkage data for type 1 diabetes is combined (21). Similarly, the common Pro12Ala polymorphism reproducibly affects the risk of type 2 diabetes, but studies of millions of sib pairs would be required to get a significant signal using linkage (5,19).

However, association studies require many more markers than linkage analysis. In linkage analysis, the markers must merely be in linkage with the disease allele (that is, the marker and the disease allele must generally be inherited together within the one or two generations spanned by a family). Thus, the markers can be several million bases away from the relevant gene. By contrast, association studies require that the markers be in linkage disequilibrium with the disease allele (i.e. the marker and the disease allele must generally be inherited together throughout the population, despite the many thousands of generations that have usually elapsed between a population's common ancestor and the present day). Because segments of linkage disequilibrium are measured in tens of thousands of bases (rather than the tens of millions of bases for linkage), hundreds of thousands of markers will be required to scan the genome for association (11,22). In addition, association studies, although potentially more powerful than linkage, still require samples sizes of thousands of individuals. Thus, association studies are currently limited to candidate genes or regions, because of the expense of a genome-wide approach.

INTERPRETING ASSOCIATION STUDIES

Association studies of complex traits also present an additional challenge, in that they are often misinterpreted and therefore appear to be poorly reproducible. If there is a “significant” correlation between genotypes at a variant and a disease or quantitative trait, that is, the p value is 0.05 or lower, this degree of correlation is usually interpreted as evidence of association between the variant and the phenotype. However, most such associations are not consistently reproduced—in a review of the literature, only 6 of 166 associations were replicated by at least 75% of subsequent studies (23). The possible causes of this inconsistency are false-positive reports, false-negative studies that incorrectly fail to replicate a valid association, or true heterogeneity between studies. We performed a meta-analysis of association studies (15), and found that of 25 reported associations, over half had no evidence of replication, indicating that the false-positive rate in the literature is high (using the p < 0.05 standard). However, a sizable fraction [8/25 associations in our study, and a comparable fraction in a similar study (24)] showed evidence of replication. In these cases, the failure to consistently replicate the association was likely explained by the modest effects of the causal variants on disease risk and the consequent false-negative studies. In most cases, the causal variant was associated with a 10–50% increased risk of disease, meaning that sample sizes in the thousands are required to achieve even a nominally significant p value < 0.05. Because most association studies had used samples of hundreds of individuals, the lack of consistency even for bona fide causal variants is not surprising.

An illustration of the importance of large sample sizes can be seen in the association between the PPARG Pro12Ala polymorphism and type 2 diabetes. This missense variant (25) was first reported to have a 3-fold effect on diabetes risk, with the more common proline-encoding allele conferring higher risk (26). Four of five subsequent studies reported that there was no association (2731), because the evidence for association did not reach the nominal significance level of p < 0.05. However, our larger study both confirmed the association and demonstrated that the effect on diabetes risk was more modest than originally described [about a 25% increased risk associated with the proline variant (19)]. The association has since been confirmed in several other large studies (3234), reaching an overall p value of less than 10−9 after more than 20,000 patients were examined (Fig. 1). The modest effect of this variant on diabetes risk likely explains most of the negative studies, which predominantly trended in the same direction and were mostly consistent with the overall estimate of the effect of this variant. Also illustrated by these data are a common phenomenon in association studies, the “winner's curse,” in which the first report overestimates the genetic effect size (15,24)—the first reported association between the Pro12Ala variant and type 2 diabetes (26) provided the largest estimate of the effect of this variant on diabetes risk (Fig. 1).

Figure 1
figure 1

The association between PPARG Pro12Ala and type 2 diabetes becomes more apparent as the sample size increases. Each line represents a different study; the point represents the point estimate of the odds ratio (OR) for the alanine allele (OR <1 suggest a protective effect of the alanine allele on diabetes risk), and the line represents the 95% confidence interval (CI) around the point estimate. The studies are arranged by increasing number of alanine alleles in the study, which is indicative of power. The combined data, representing more than 5,000 alanine alleles in over 20,000 individuals, is shown on the last line, and corresponds to an OR of 0.82 (95% CI, 0.77–0.87), with a p value of 1.8 × 10−10. That is, carriage of an alanine allele is associated with an approximately 20% decreased risk of type 2 diabetes. The initial study that reported an association between Pro12Ala and type 2 diabetes (26) is indicated with the arrow, and reports the largest observed effect of this variant on diabetes risk, consistent with the winner's curse (15).

CURRENT AND FUTURE DIRECTIONS

In the future, it will become possible to screen the vast majority of common genetic variation for a role in complex traits. These studies will be greatly facilitated by the recognition of that many common variants are strongly correlated (in linkage disequilibrium), and hence redundant (11), meaning that a few hundred thousand single nucleotide polymorphisms (SNPs) will suffice to survey the approximately ten million variants with frequency 5% or greater (17,22). However, such studies are still too laborious and expensive for routine application to complex traits. Until they are more practical, we must still select genes and variants to study, and the choice of trait will also be important to maximize the chance of success.

We have chosen to study both common diseases (because of their direct medical relevance) and quantitative traits (because they are easier to measure accurately in large populations, often have a stronger genetic component, and are often themselves risk factors for disease). For example, we study stature (adult height) as a model quantitative trait because of its ease of measurement and high heritability. We have successfully performed linkage analysis to identify several regions that likely harbor genes that affect adult height (35). Several of these have been confirmed (3638), and we will follow up these findings with association studies of genes in these regions of linkage. We are also focusing on studies of body mass index, a measure of obesity that predicts future diabetes, heart disease, and mortality.

To increase the likelihood that a candidate gene will be harbor variation relevant to traits we are studying, we integrate a wide variety of information, including previous genetic studies, expression analysis, linkage data, animal models, and knowledge of biologic pathways. We also aim within each gene to both broadly survey the majority of common variation but also capture any putative functional variants such as missense polymorphisms, and potentially also variation in evolutionarily conserved noncoding regions (39). Finally, genetic variants that have been under evolutionary selection are by definition functional and thus may be more likely to contribute to disease susceptibility (e.g. the sickle cell variant has been selected for its protective effect against malaria). To find such variants, we need to be able to recognize the genetic signatures of selection.

To help characterize the genetic signatures of selection, we studied variation around the gene encoding lactase, LCT. Most people and all other mammals lose the ability to digest lactose, which comes exclusively from dairy (40,41). However, certain human populations retain this ability into adulthood, and this trait is inherited in association with a genetic variant upstream of the lactase gene (42). By studying the length of haplotypes at this locus and the frequencies of genetic variants near LCT in different populations, we provided the first population genetics-based evidence that there has been strong selective pressure in favor of lactase persistence in Northern European populations. Indeed, the signature of selection we described is one of the strongest in the entire human genome. Genome-wide distributions for haplotype length and allele frequencies in different populations can now be calculated from genotype data in public databases such as the International HapMap Project (www.hapmap.org) or dbSNP (www.ncbi.nlm.nih.gov/SNP). Using these distributions, we plan to search the remainder of the genome for additional signatures of selection.

SUMMARY

Understanding the genetic basis of complex traits and common disease is an important goal, since the pathways that affect the risk of disease in patients are also potentially good drug targets. For example, two of the alleles known to affect the risk of type 2 diabetes are in genes (PPARG and KCNJ11) that also encode drug targets (5). Knowledge of the relevant variants may also aid in prediction and prognosis and in designing optimal therapies and intervention. Identifying the causal genetic variants will likely require a number of different approaches, but association studies are likely to be a significant tool in the near future. However, care must be taken in performing and interpreting these studies, so as to avoid generating false leads and also to correctly identify the bona fide genetic risk factors for disease.