The complete haplotype map of the human genome provides an unprecedented view of human genetic variation.
The completion of the Human Genome Project revealed, among other things, that sequence variation in the human genome is abundant1,2—so abundant, in fact, that identifying disease-causing variation from among the vast amount of inconsequential variation poses a formidable challenge. Understanding how genetic variation is organized in the genome would greatly improve the efficiency with which we can identify alleles underlying common diseases. In a recent paper in Nature, the International HapMap Consortium reports on its genome-wide survey of common genetic variation in the human genome and reveals the underlying architecture of genetic variation in human populations3.
It is estimated that there are 10 million polymorphic sites in the human genome, the result of mutations in the DNA of our ancestors that have been transmitted through the human population over time. It has also long been recognized that genetic variation at different nearby sites is correlated, or in linkage disequilibrium, but the degree of correlation varies between any two sites. The patterns of linkage disequilibrium in the genome reflect, among other things, the process of chromosomal recombination, whereby segments of DNA are swapped between maternally and paternally derived chromosomes. The segments of chromosomes that remain intact, without disruption from recombination, are inherited in blocks (Fig. 1). The delineation of these blocks, and the genetic diversity represented within them, is revealed in the human genome HapMap.
The International HapMap Consortium identified a set of over one million common (>5% frequency of less common allele) single nucleotide polymorphisms (SNPs) evenly spaced throughout the genome at an average distance of one per 5 kb, which they genotyped in 269 individuals from four diverse populations. Their analysis of haplotype structure and diversity confirmed that recombination rates vary extensively throughout the genome, with some regions (centromeres for example) exhibiting little or no recombination and others, hotspots, demonstrating recombination rates dramatically higher than background rates. For example, within one 5-Mb region, 80% of all recombinations occurred in 15% of the sequence.
The result is that most of the genome omprises long blocks of DNA that are disrupted by recombination sites. These blocks harbor many sequence variants, but because these variants are in strong linkage disequilibrium, the diversity of the block is actually quite limited. On average, for a given block, there are only four to six haplotypes or unique combinations of alleles at adjacent sites. Furthermore, the investigators report that common and rare haplotypes are often shared across ethnic populations, although the frequency of a particular haplotype may vary between populations. As it turns out, the average common SNP is in strong linkage disequilibrium with three to ten other SNPs, indicating that the redundant SNPs add no new information. Therefore, a set of less than 500,000 SNPs is sufficient to capture information on all common variation in the human genome.
How will this information be used? One of the goals of the HapMap project was to improve the ability to carry out genome-wide association studies for the purpose of identifying gene variants underlying quantitative traits, common diseases and response to therapeutics (pharmacogenomics). By surveying the entire genome, the genome-wide approach makes no a priori assumptions about which genes may be important, in contrast to a candidate-gene approach in which specific genes are targeted for analysis. The approach has many strengths, but has met with skepticism for a number of reasons, including a lack of information on the extent of linkage disequilibrium through the genome4.
With the knowledge gained from the HapMap project, we are one step closer to making genome-wide association studies feasible. By selecting a set of nonredundant tagged SNPs, a tenfold reduction in SNP genotyping can be realized, a vast improvement in efficiency for genome-wide association studies. These genotyped SNPs will be used in population studies to examine association with disease. A positive finding may indicate that the specific block, represented by the haplotype-tagged SNP may harbor disease-causing variants. As reported by the investigators, this approach will help to identify not only SNPs but structural variants (deletions, duplications, rearrangements) that may underlie disease.
Was it worth the investment? The HapMap project has validated and estimated frequencies of over a million common SNPs throughout the genome in the public SNP database, dbSNP, with 11,500 of them residing within genes. Coupled with information on linkage disequilibrium, this provides a rich resource for all consumers of human genetic information, not just those undertaking genome-wide studies. In addition, the HapMap project provides empirical data to validate previous hypotheses concerning natural selection. Furthermore, the data have allowed us to resolve uncertainty about the utility of nongenic sequences that are conserved between species. Theoretical predictions that these regions lacked diversity have been refuted by HapMap data, which suggests instead that variation in these regions is skewed toward rare alleles. These novel findings highlight the importance of continued research into the function of these variants and their possible role in disease.
The question remains whether this extraordinary resource will greatly improve our ability to uncover genetic associations with disease. As a first step in cataloging human genome variation, the HapMap is an impressive achievement. In the absence of a robust strategy for identifying candidate genes a priori, whole-genome haplotype mapping offers an enticing approach for discovery of disease-causing alleles. However, much of this is predicated on the theory that common variants cause common diseases. Several examples can be found in support of this theory, but few investigators have explored alternative theories, such as one that ascribes a role to rare variants in common diseases. One exception is a recent study by Cohen et al.5, which found that the sum of rare variants in three candidate genes contributed significantly to low HDL-cholesterol levels in the general population, supporting a hypothesis that common diseases can be the result of the aggregate effects of rare variants.
In the human genome, rare variants (those having minor allele frequencies <5%) are plentiful: 45% of SNPs in specific regions evaluated by HapMap investigators were rare. Although haplotypes capture common variation, their utility as markers of rare variants is unknown. Regardless of how well haplotypes perform in this situation, if rare variants are important in common diseases, they will need to be studied directly. Since PCR-based sequencing of diploid samples (used to populate dbSNP) may be biased against very rare variants, concerted efforts to find these SNPs will need to be undertaken. To this end, technologies that make it possible to carry out deep resequencing of gene regions at a substantially reduced cost and time savings, such as that highlighted in a recent issue of Nature6, will become increasingly important as a complement to genome-wide HapMap approaches.