Introduction

The genome-wide association (GWA) approach for genetic studies of complex human diseases was first proposed by Risch and Merikangas in 1996,1 but only became feasible 10 years later. The Human Genome Project (HGP) was launched in 1990 and it took 13 years to finish the sequencing of the human genome. In 2001, both the International Human Genome Sequencing Consortium and Celera Genomics reported draft sequences of the human genome.2, 3 The HGP was deemed complete in April 2003, exactly 50 years after the description of the DNA double helix structure by James Watson and Francis Crick.4 The HGP revealed that the human genome is composed of ∼3 billion base pairs and an estimated 20 000–25 000 protein-coding genes. The completion of HGP, which represents an important milestone in human genomics,5 was followed by identification and deposition of millions of single nucleotide polymorphisms (SNPs) into public databases by The SNP Consortium (TSC) and International Human Genome Sequencing Consortium.6 This provided the foundation upon which GWA studies would subsequently build.

SNPs and copy number variations

SNPs are the most common genetic variations in the human genome; currently >10 million SNPs have been deposited into public databases, most of which are anticipated to have no functional effect. These genetic polymorphisms have proven to be very useful as genetic markers, and can be used to detect the disease variants via linkage disequilibrium (LD). Owing to the large number of SNPs in the human genome, they provide the highest resolution (in comparison to other genetic markers such as micro-satellites and mini-satellites) and enable researchers to comprehensively interrogate the entire human genome. Nonetheless, SNPs alone can neither explain the total human genetic diversity nor explain the genetic susceptibility to complex diseases and adverse drug reactions. Recently, the discovery of thousands of copy number variations (CNVs), which are ubiquitous in the human genome, has provided another new insight into the complexity of the genetic variations in the human genome. CNVs are expected to play an important role in the genetic basis of complex diseases, and are therefore expected to share the limelight with SNPs in the future GWA studies. CNVs are structural variations or genomic alterations that change the number of copies of DNA involving segments that are larger than 1 kb (including deletions, insertions and duplications). CNVs were first reported ubiquitous in the human genome in 2004.7, 8 The global map of CNVs was finished in 2006 and lead to identification of about 1500 CNV regions.9

CNVs and other structural variations, such as inversions, that have thus far been identified have been deposited in the Database of Genomic Variants (http://projects.tcag.ca/variation/). The main objective of this international database is to provide a comprehensive catalog of structural variations in the human genome. The detailed description of CNVs is beyond the scope of this paper; however, a detailed review paper about structural variations is available.10 CNVs have been reported to affect the gene expression11 and the importance of CNVs has become recently apparent in susceptibility to complex diseases like autoimmune diseases, autism and bipolar disorder.12, 13, 14, 15

International HapMap Project and the concept of LD

With the identification of millions of SNPs in the human genome, it remains a daunting task to genotype every single SNP, even with the latest genotyping technologies. To overcome this obstacle, the International HapMap Project was initiated in 200316 with the aim of characterizing LD patterns, and identifying haplotype-tagging SNPs in a total of 270 DNA samples that was collected from four major populations of European, African and Asian ancestry. The Phase I and Phase II of the International HapMap Project were completed in 2005 and 2007 respectively.17, 18 The application of the International HapMap Project is evident once we consider tagging SNPs that were identified in this global project were found to be ‘transferable’ in many populations around the world19, 20 and in isolated populations.21, 22

At the same time, Perlegen® Sciences genotyped ∼1.58 million SNPs on 71 individuals of European, African and Asian ancestry, and reported that these SNPs were able to capture most of the common genetic variations based on LD.23 The major lesson that geneticists learnt from these two studies is that it is not necessary to genotype every single SNP in the human genome because this would be redundant. SNPs that are close to each other within a genomic region tend to be inherited together more frequent than expected by chance in a block pattern (known as haplotype) due to the presence of LD. Because of this unique relationship among SNPs, genotyping merely a set of informative SNPs to serve as proxy markers (usually called tagging SNPs, with r2>0.8) is sufficient to capture most of the genetic information of SNPs, which are not genotyped with only slight loss of statistical power. r2 is a measurement of ‘correlation’ or LD between two SNPs whose value ranges from 0 to 1 (r2 of one indicates complete LD). r2 depends on both allele frequencies and recombination between the two SNPs. The sample size that is required in a genetic association study is inversely proportional to the r2 value.24, 25

Key advancements in genotyping technology and genetic information

With the rapid development of genotyping technologies and decreasing of genotyping costs, currently, genotyping half a million SNPs on thousands of DNA samples is within the capacity of many research institutes. In addition to the fixed content genome-wide genotyping arrays, several custom made genotyping products were also introduced by Illumina® and Affymetrix® to accelerate the fine mapping of the genomic regions identified by GWA studies and linkage analysis.26, 27, 28 The genome-wide genotyping products supplied by Illumina and Affymetrix such as Illumina HumanHap550 and Affymetrix GeneChip 500K offer good coverage of the International HapMap Phase I and Phase II data in both Caucasians and Asians. However, the genomic coverage in Africans was lower due to greater genetic diversity and weaker LD.29, 30

With the wealth of genetic information gathered from the HGP and International HapMap Project, the collection of large number of cases and controls, the rapid advancement in genotyping technologies and the advent of powerful analysis algorithms such as PLINK,31 the GWA approach is rapidly becoming feasible. GWA approach represents the most powerful and efficient study design in genetic dissection of complex diseases in comparison to traditional linkage studies24, 25 and will remain so until we reach the $1000 whole-genome sequencing era.32

GWA studies of complex human diseases

The strengths of GWA approach are that it is hypothesis free and that it is able to comprehensively interrogate the entire human genome. This approach enables investigators to identify novel loci or genes for various diseases and quantitative traits. The achievements of GWA studies have been witnessed in genetic dissection of several complex diseases, namely, age-related macular degeneration (AMD), obesity, inflammatory bowel diseases (IBD), type-2 diabetes (T2D), breast cancer and prostate cancer. Owing to article length constraints, only the GWA studies that were published on these six complex diseases have been selected to be reviewed in this paper. Table 1 summarizes the genes or loci that were identified by GWA studies or consistently replicated for these diseases. In Figure 1, which illustrates the number of GWA publications from 2005 to June/July 2007, we see a sharp increase in the number of publications in 2007 in comparison to the previous 2 years. The aims of this review paper are to provide a timely summary of the GWA studies that were published until June/July 2007 and to evaluate to what extent the results have been replicated and validated. Replication of GWA results is essential to distinguish between ‘statistical’ artifacts and true associations.57 This review paper was organized into several sections according to the disease phenotypes and chronologically by the year of publication. The disease sections are followed by a discussion on the determinant factors of a successful GWA study, the future challenges and limitations of GWA approach.

Table 1 The genes or loci that were identified by GWA studies or consistently replicated for complex diseases is reviewed in this paper
Figure 1
figure 1

The number of GWA publications from 2005 to June/July 2007.

AMD

The success of GWA approach in genetic dissection of complex diseases was apparent in April 2005.33 The first GWA study which used a commercial genotyping platform to examine the genetic basis of AMD had been published. In the genome-wide scan, Klein et al33 identified a common intronic variant in the complement factor H (CFH) gene that strongly associated with AMD in 96 cases and 50 controls. The P-value of this association surpassed the genome-wide significance by Bonferroni correction even with a relatively small sample size. Perhaps both the commonness of the allele and the large genetic effect was reported to have – odds ratio (OR) for homozygous risk allele was 7.4 – contributed to the highly significant finding. In addition, accurate phenotyping may have played a key role in the study, since only AMD patients with the presence of large drusen were recruited, which reduced phenotypic heterogeneity. The genomic region that initially identified was followed by re-sequencing and fine mapping, and finally a nonsynonymous SNP (Y402H) that was strongly associated with AMD was reported. Concurrently, two independent groups58, 59 also reported similar results via fine mapping of the genomic region in 1q31-32 that was identified in previous studies. That three separate studies firmly pinned down the same functional variant speaks of the robustness of the association, which subsequent studies60 have replicated. This is by far the most robust association that has been derived from a GWA study.

AMD can be classified into two clinical subtypes, dry (non-neovascular) or wet (neovascular). The former subtype, which accounts for ∼90% of AMD cases, was associated with the functional variant identified in CFH. However, a novel genetic variant, an SNP (rs11200638) located upstream of the HtrA serine peptidase 1 (HTRA1) putative transcription start site, was also identified for wet subtype of AMD in the study by DeWan et al.34 This association was subsequently replicated in a Caucasian population61 and in a Japanese population.62

Body mass index and obesity

Several GWA studies were published in 2006 after the genetic community witnessed the successful results of AMD. The first was published in April 2006 to study the genetic basis of obesity. Using the Affymetrix GeneChip 100K to genotype individuals from the Framingham Heart Study cohort, Herbert et al35 identified a novel common genetic variant (rs7566605) near the insulin signaling protein type 2 (INSIG2) gene that was associated with an increased body mass index (BMI) or obesity. They subsequently tested the SNP in five additional cohorts and the association was replicated in all except one. These results provided strong statistical evidence to support the association between INSIG2 gene and obesity; nevertheless, the genetic association failed to be replicated by three independent studies.63, 64, 65, 66 To date, the results from subsequent genetic association studies have been conflicting; the association seems to be reproduced in several but not all the studies.67 Since then, there has been little success in the identification of genetic determinants of obesity, except for one novel gene that will be discussed in the T2D section of this review.

IBD

The first GWA study on IBD was conducted by the North American IBD Genetics Consortium.40 IBD is a chronic inflammatory disease of gastro-intestinal tract and it can be divided into two clinical subtypes, namely, Crohn's disease (CD) and ulcerative colitis. The investigators recruited ileal CD cases to minimize the phenotypic heterogeneity. This careful ascertainment of cases is recommended because it will increase the statistical power to detect the disease variants especially if the diverse clinical manifestations are due to genetic heterogeneity. Duerr et al40 identified a novel gene for IBD – a nonsynonymous SNP (Arg381Gln) in interleukin-23 receptor (IL23R) gene, during an interim analysis of their data. This gene encodes a subunit of the receptor for IL23 (pro-inflammatory cytokine) and thus is an interesting and biologically plausible gene for inflammatory diseases. The genetic association was subsequently replicated in a Jewish case–control and in a family-based association study.

The association that they found in the interim analysis was then unequivocally replicated in four independent groups68, 69, 70, 71 and provided compelling evidence to support IL23R as a genuine susceptibility gene for IBD. The results of their complete genome scan were published recently41 and the investigators were able to uncover several novel loci for IBD, the most notable novel gene was autophagy-related 16-like 1 (ATG16L1). The nonsynonymous SNP (rs2241880) located in the exon 8 of this autophagy gene was previously reported to be associated with CD in a gene-centric GWA study.72 Likewise, this association was unambiguously replicated by two independent studies from UK.73, 74

T2D

Although the role of genetic susceptibility in T2D is well established, the results from genetic association studies have been quite disappointing. Before 2006, there was limited success in genetic studies of T2D. The genes identified, with the exception of PPARG and KCNJ11 gene,44, 45 have been conflicting and inconsistent. In 2006, TCF7L2 gene was first reported to be associated with T2D by deCODE Genetics, who fine mapped a suggestive linkage region identified previously in an Icelandic population.46 The association has since been consistently replicated in more than 20 studies across different populations with diverse ancestral backgrounds, thereby providing convincing evidence that TCF7L2 was associated with T2D.75

The first GWA study on T2D revealed four novel loci, the most notable being a nonsynonymous SNP in solute carrier family 30 member 8 (SLC30A8) gene.47 It encodes a zinc transporter protein expressed only in β cells, which is also implicated in the final stages of insulin biosynthesis, making this gene a strong biological candidate for T2D. In addition, the investigator also confirmed the previously identified diabetes gene – TCF7L2. However, only two from these four novel loci, were successfully replicated by three diabetes GWA studies that published concurrently in Science (discussed below).

The aim of the GWA study by Frayling et al36 was to identify susceptibility genes for T2D; however, they uncovered a new gene for BMI – fat mass and obesity-associated (FTO) gene. Initially, several SNPs in FTO gene were found strongly associated with T2D. However, the association was abolished after adjustment for BMI in cases and controls. It was therefore concluded that the FTO gene was more likely to be associated with BMI or obesity. To test this hypothesis the investigators attempted to replicate the association in nine independent studies and the results showed that the common variants in FTO gene were reproducibly associated with BMI and the risk of being overweight or obese from childhood to adult. This enormous replication effort has provided strong evidence beyond statistical doubt for the genetic association.

Huge success in genetic dissection of T2D was achieved this year by three GWA studies conducted by the Wellcome Trust Case Control Consortium (WTCCC), Diabetes Genetic Initiative and Finland-United States Investigation of Non-Insulin Dependent Diabetes Mellitus Genetics.48, 49, 50 These diabetes research groups were able to discover three novel loci for T2D. Their success highlights the importance of scientific collaboration and sharing of genome-wide genotyping data among different research groups. For these three GWA studies, the combined sample size exceeded 32 000 samples. This large sample size allowed the investigators to detect variants with modest genetic effects (OR of 1.1–1.2). All three putative candidate genes that were identified are biologically plausible genes for T2D, that is, cyclin-dependent kinase inhibitor 2A/2B (CDKN2A/CDKN2B), CDK5 regulatory subunit associated protein 1-like 1 (CDKAL1) and insulin-like growth factor 2 binding protein 2 (IGF2BP2). In addition to these discoveries, they also successfully replicated the genetic association of several genes known to be associated with diabetes, namely, PPARG, KCNJ11, TCF7L2, SLC30A8 and HHEX. In all, eight loci/genes have been detected and consistently replicated for T2D in Caucasians. Interestingly, all the genetic variants identified have been located in noncoding regions, particularly an SNP (rs10811661) that is located 125 kb away from the nearest annotated genes that is CDKN2A/CDKN2B. In addition to these three studies, an independent GWA study was conducted concurrently in an Icelandic population.51 The investigators also managed to identify the CDKAL1 gene for T2D. All newly discovered loci by these GWA studies were replicated in a series of studies with large sample size, and are therefore likely to be bona fide loci or genes for T2D. The identification of these novel loci is important to further enhance our understanding on the genetic basis and pathogenesis of T2D.

Breast cancer

The highly penetrant genes – BRCA1 and BRCA2 – only account for ∼20% of the total genetic risk of breast cancer. Thus far, the results from the genetic association studies for this cancer have been dissatisfying. However, with the efforts of the Breast Cancer Association Consortium in conducting candidate gene case–control association studies with enhanced statistical power by combining several breast cancer cohorts, two novel genes were identified, namely, CASP8 and TGFB1.76 Recently, the success of genetic studies of breast cancer has also been seen in three studies52, 53, 54 and their findings are starting to shed some new light on the genetic basis of this cancer. With their three-stage study design and a sample size of more than 50 000 cases and controls, Easton et al52 identified six highly significant SNPs. The most notable genes identified were fibroblast growth factor receptor 2 (FGFR2) and the LD block, which contain TNRC9 gene. FGFR2 encodes a tyrosine kinase receptor, which was overexpressed in breast cancer. Therefore, it is a strong biological candidate gene. This novel gene was simultaneously uncovered by Hunter et al53 who identified four SNPs within intron 2 of FGFR2 that were highly associated with breast cancer. Collectively, these findings indicate a novel susceptibility gene for breast cancer, but further studies are required to fine map and to identify the disease variants.

The third GWA study was done by Stacey et al54 in Icelandic breast cancer cases and controls. Ten SNPs with the most significant P-values were tested in additional five cohorts. However, only two SNPs were consistently associated with breast cancer that is one SNP on chromosome 2q35 (rs13387042) and the other one on 16q12 (rs3803662). Interestingly, the 2q35 region contains no known genes, but the SNP that falls on 16q12 is located near the 5′ region of TNRC9 gene that was simultaneously identified by Easton et al52.

Prostate cancer

Just as with other cancers; identifying the genetic variants with modest effect for prostate cancer has been proven difficult. However, some success was achieved recently in genetic studies of prostate cancer. Chromosome 8q24 region was first identified via the genome-wide linkage analysis in Icelandic families with prostate cancer. Allele-8 of the microsatellite DG8S737 was consistently associated with the disease in a series of studies.77 One year later, the investigators reported the second genetic variant in the 8q24 region for prostate cancer through GWA study.55 Yeager et al56 also applied the same approach in genetic exploration of prostate cancer. Haiman et al78 followed up on their initial admixture results by extensively fine mapping the region. The results from these three independent studies suggest that 8q24 is implicated in prostate cancer and that this genomic region, may be harboring susceptibility genes for prostate cancer. Further studies are needed to discern the disease variants and genes within this region.

WTCCC

WTCCC is the largest ever GWA study to explore the genetics of seven common diseases.37 In total, 17 000 individuals that is 2000 cases for each of the seven common diseases and 3000 shared controls were genotyped by Affymetrix GeneChip 500K. This is a huge success because many novel loci were uncovered for these common diseases and many of them have been successfully replicated in other independent sample sets, namely, T2D (as discussed above), CD,43 and type-1 diabetes.79 As for CD, the second autophagy gene – immunity-related GTPase family, M – was revealed in the study by Parkes et al.43 In addition to the discovery of many novel loci, almost all of the genes that identified by previous studies for these common diseases were successfully replicated by WTCCC, for example, HLA-DRB1, INS, CTLA4, PTPN22, IFIH1 and IL2RA80 genes were replicated for type-1 diabetes. As a whole, these results suggest that there may be two important pathways for the pathogenesis of CD that is IL23 and autophagy pathways.

From the list of novel loci or genes revealed by WTCCC, one of the most intriguing results, perhaps, is the association of coronary artery disease with the region on chromosome 9p21 as this region contains the coding sequences for CDKN2A/CDKN2B, which are genes associated with T2D as well. At the same time, the same region was found to be associated with coronary heart disease and myocardial infarction by other two independent studies respectively.81, 82 These results might help to explain why some individuals are more susceptible to these two closely related common diseases.

Parkinson disease and other neurological diseases

The papers that were discussed above are examples of successful GWA studies, which yielded spectacular results (except the association of INSIG2 gene with obesity). On the other hand, there have been examples of GWA studies of complex diseases where validation of initial results has not been achieved yet. For instance, a two-stage GWA study by Maraganore et al83 identified 13 SNPs strongly associated with Parkinson disease. Nevertheless, lack of replication was observed subsequently84, 85, 86 despite a large sample size of ∼10 000 from a large scale international study.87

Hitherto three genome-wide scans using either Illumina HumanHap300 or HumanHap550 genotyping platform were completed for these complex diseases, namely, Parkinson disease, ischemic stroke and amyotrophic lateral sclerosis88, 89, 90 and the data was released into public domain. None of the SNPs achieved or surpassed genome-wide significance in these three studies. There are several possible explanations: (1) the Bonferroni correction is too stringent, which may have overcorrected the significance threshold, (2) there is a lack of statistical power to detect common variants with modest effect in these genome-wide scans because of relatively small sample sizes and finally, (3) perhaps there is no disease variant with large genetic effect (like the case of CFH gene for AMD) for these neurological diseases. Nevertheless, the availability of these resources will allow researchers to access the genome-wide scan data and will thus accelerate the pace of discovery to identify novel genetic variants for these neurological diseases.

Determinant factors for a successful GWA study

From the experience of recent GWA studies, we learn that one of the major determinants of the success seems to be the requirement of a large sample size to provide adequate statistical power to detect genetic variants with modest effect that is OR<1.5. The statistical power of genetic association study is basically a function of sample size, magnitude of genetic effect, and allele frequency. As the latter two factors are unknown until the genetic variants are uncovered, sample size is the major controllable factor in the determination of statistical power. In addition, power also depends on the tag SNPs selected that is the genome coverage.91 Both these factors are modifiable, thus, increasing both the sample size and the genome coverage will increase the statistical power of the study. The large sample size can be more easily achieved through consortia or collaboration as demonstrated in the breast cancer (BCAC) and T2D studies respectively, and the Genetic Association Information Network is a good example of these efforts. Genetic Association Information Network is a public–private partnership that was established to interrogate the genetic basis of common diseases through a series of collaborative GWA studies.92

Almost all the SNPs that appeared highly significant in WTCCC GWA study either had strong prior evidence of association with the diseases studied or were successfully confirmed in the subsequent replication studies. A large sample size with an enriched statistical power is both crucial and essential in GWA study to ensure that genuine associations rank at the top of the SNPs according to the P-values.93 Researchers should be cautious when applying stringent significance thresholds (like Bonferroni corrections) or when weighing the SNPs for replication in a two-stage study design to control the false-positive associations, as they run the risk of overcorrecting and subsequently having type II error (because many true positive associations with modest P-values would have been excluded in the genome-wide scan). Stringent significance threshold is only reasonable if the sample size in the genome-wide scan is large and the statistical power is adequate enough to allow most of the true signals to rank at the top of the list.

All reviewed studies used commercially available genotyping platforms from Illumina, Affymetrix or Perlegen. Good genome coverage is of utmost importance in GWA studies because the underlying principle of this approach is based on LD to detect the disease variants. In those regions with a scarce number of SNPs or which are poorly covered by markers, genuine disease variants might be missed since these disease variants were not in strong LD (r2>0.8) with any of the SNPs that were genotyped on the array. Furthermore, extensive replication is essential to declare a bona fide association. Reproducing strong associations that were found in the GWA does not seem to be a major issue in the recent GWA studies discussed above (with few exceptions), as all were successfully validated in a series of replication studies and provided strong evidence to support the genetic association.

Proper study design also plays a key role in determining the success of GWA studies. Researchers need to pay more attention on the methodology and analysis issues such as applying stringent control of the quality of genotyping data to minimize the genotyping errors which could both produce spurious associations and mask the true associations. It is also recommended to use other genotyping platforms such as Sequenom iPlex to validate the SNPs from genome-wide scan since this genotyping technology uses a totally different principle in allelic discrimination or genotyping, that is, MALDI-TOF MS (matrix assisted laser desorption/ionization, time-of-flight mass spectrometry)94 versus hybridization and fluorescent intensity measurement methods employed in both Illumina and Affymetrix platforms. The accuracy of classification of disease phenotype is equally important for a successful GWA study; the importance of this criterion was demonstrated in the AMD and Crohn's disease studies by Klein et al and Duerr et al33, 40 respectively.

It is important to address the issue of population stratification, even when the study was conducted in a relatively homogenous population, or if the cases and controls were well matched and recruited from the same geographical location, because these techniques cannot totally eliminate the effect of population stratification. The effect of this confounding factor will be intensified in the GWA studies where tens of thousands of samples are needed.95 Freeware such as EIGENSTRAT96 (available online) should be applied in GWA studies to identify outliers with different ancestry backgrounds and to exclude them from further analysis. This issue has been receiving attention from researchers in their GWA studies as discussed above to exclude entirely the possibility that the positive associations observed are attributed to population stratification.

Challenges waiting ahead

Once the researchers establish the association beyond statistical doubt, three additional challenges are still waiting ahead. First, although many novel loci were identified for complex diseases, the task of identifying the actual functional disease variants remains ahead. It is often difficult to discern the disease variants, especially when the surrounding markers are in perfect or nearly perfect LD (r2>0.9), because they will give almost similar strength of association. Therefore, re-sequencing is usually needed after identifying the genomic region that potentially harbors the disease variants. Re-sequencing strategies will enable investigators to uncover novel and uncommon variants. Often, functional studies are also required, but these studies are only feasible for those genes or regions, which are well-characterized. In most cases, GWA approach is unlikely to directly reveal the functional variants for the disease. This is demonstrated by the AMD study where DeWan and coworkers first identified an intragenic SNP in their genome-wide scan before they unraveled the functional variant that affects the transcription of HTRA1 gene.34

Second, it remains difficult to establish the functional role of the disease variants, for example, how the disease variants affect the structure and function of the genes (and its end product – proteins), and also transcription regulation. This is especially challenging for SNPs located in genes, which are not well-characterized or unknown functions, noncoding regions and gene deserts since our knowledge about the functional elements in the human genome is still very limited. For instance, a strong association for a cluster of SNPs on chromosome 5p13.1 was consistently found for CD.37, 42 Interestingly this region was located within a 1.2 Mb gene desert and the nearest annotated gene is prostaglandin E receptor 4 (PTGER4). So, one wonders how these ‘long-distance’ variants affect the function of the disease genes? Perhaps we can get some answers from the pilot phase of the ENCODE Project.97 The ENCODE Projects found that regulatory regions or elements of a gene can be located far from it and yet still be able to affect expression and function the gene. This project was initiated after the completion of HGP with the aim to identify and characterize all the functional elements within the entire human genome.98 Although the pilot phase of the ENCODE Project was finished; there is still 99% of human genome that needs to be explored. Investigating the functional roles of those SNPs located within noncoding regions and gene deserts pose tremendous challenges while promising the possibility of great rewards, the identification of novel functional elements previously uncharacterized. Lastly, it is not a trivial task to elucidate the molecular pathway based on the results derived from GWA studies of the disease, especially for genes or proteins with unknown function.

Limitations of GWA studies

Nowadays, GWA approach is the ‘best’ medicine in dissecting the genetic basis of complex diseases, but it is not the ‘panacea’. There are several limitations and problems with this study design. In GWA approach, several hundred thousands of SNP markers throughout the entire genome are analyzed at once, creating a multiple-hypothesis problem which can lead to substantial type I error. To minimize the false-positive results, statistical adjustment like Bonferroni correction is applied and a very stringent P-value is needed, usually at the significance level of 10−7. This translates into the requirement of a large sample size of tens of thousands of samples for both genome-wide scan and replication studies. This requirement is hard and is not likely to be attained in a single study; hence, collaborations and the establishment of consortia are of utmost importance.

GWA approach is based on the principle of LD; therefore, the genetic markers that identified are unlikely to be the disease variants. Extensive re-sequencing and fine mapping are required to discern the disease variants; it is a great challenge in fine mapping when the SNPs within the genomic region are in strong LD. This is a double-edged sword, although strong LD helps to reduce the number of markers to genotype in the genome-wide scan, it also limits the ability to ‘resolve’ the association and creates difficulty to identify the ‘culprits’. Biological studies are therefore required to determine the functional roles of the disease variants.

The GWA study design is hypothesis-generating rather than hypothesis-testing; therefore, replication is paramount in GWA studies to confirm the results, and replication has been widely accepted as the gold standard to discern genuine genetic associations. Lack of replication or conflicting result is still a problem in some GWA studies of diseases such as Parkinson's disease83 and INSIG2 gene for obesity.35 These and other problems in GWA studies have been well addressed by Shriner et al99 and Williams et al100 in their Letters to Science.

Conclusions

The genetic spectrum of complex human diseases has yet to be elucidated, but the recent achievements in genetic studies of various complex diseases have provided some new insights. Most of the genetic variants that have been consistently identified for the diseases studied are common (minor allele frequency >5%) and confer only modest genetic effect (OR<1.5). Does it mean that the genetic spectrum of complex diseases comprises only of common variants with modest effect? The answer is probably no; rare variants, such as IL23R,40 are likely to contribute to disease risk as well as influence the quantitative trait for example, high-density lipoprotein cholesterol level.101, 102 The relative proportion of common variants versus rare variants in the total genetic contribution to both complex diseases and quantitative traits is still largely unknown.

Less than expected success in identifying rare genetic variants might be due to the fact that current genotyping platforms have poor coverage for rare variants. The SNP markers included in both Illumina and Affymetrix genotyping arrays are biased toward common alleles. As a result, these markers are in weak LD (r2) with rare variants, mainly because of the discrepancy between their frequencies. To get a high r2 value, the frequencies of the two SNPs must be comparable in addition to no recombination that occurred between them, so the proxy marker could predict the nongenotyped SNP via LD. Since statistical power drops drastically for those rare SNPs, a larger sample size than the figure reported in the recent GWA studies might be needed to detect rare variants.

According to the common disease common variant hypothesis,24, 25 common disease such as T2D is likely due to common genetic variants with modest effect. Since the genetic variants are common, they are likely to be shared across different populations with diverse ancestry backgrounds. Most of the GWA and replication studies were conducted in Caucasian populations; less replication effort has been devoted in other populations like Asians and Africans. So it would be interesting to determine and investigate how many loci or genes identified by these GWA studies are also associated with the disease phenotypes in other populations. Well-designed replication studies are crucial to either validate or refute the initial positive association. The guidelines to conduct replication studies were suggested by NCI-NHGRI Working Group on Replication in Association Studies.57

The most successful studied diseases thus far are T2D and CD; about 10 loci or genes have been consistently identified in each of the diseases from the findings of GWA studies. Does this signal the end of genetic association studies of these diseases? This is probably only the beginning; it has been predicted that there are still plenty of genetic variants or genes underlying the genome for the researchers to uncover.

Note

The pace of development in GWA studies is at an unprecedented speed, such that during the submission and revision of this review paper, a substantial number of GWA studies were also published. However, it is certainly beyond the scope and the length in this review paper. However, we think that it is important to briefly highlight the GWA studies published during July–December 2007 (although the list is incomplete); the complex diseases that were interrogated by these GWA studies include the restless legs syndrome (periodic limb movements),103, 104 coronary artery disease,105 multiple sclerosis,106 gallstone disorder,107 exfoliation glaucoma,108 colorectal cancer,109, 110 HIV,111 type 1 diabetes,112 childhood asthma,113 atrial fibrillation,114 sporadic amyotrophic lateral sclerosis115, 116 and rheumatoid arthritis.117, 118