Introduction

Genome-wide association studies (GWAS) are emerging as a major tool to identify disease susceptibility loci and have been successful in detecting the association of a number of SNPs with complex diseases.1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 However, testing only for association of a single SNP is insufficient to dissect the complex genetic structure of common diseases. Extracting biological insight from GWAS and understanding the principles underlying the complex phenomena that take place on various biological pathways remain a major challenge. The common approach of GWAS is to select dozens of the most significant SNPs in the list for further investigations. This approach, which takes only SNPs as basic units of association analysis, has a few serious limitations. First, a single SNP showing a significant association with complex diseases typically has only mild effects.13 The common disease often arises from the joint action of multiple loci within a gene or the joint action of multiple genes within a pathway. If we consider only the most significant SNPs, the genetic variants that jointly have significant risk effects but individually make only a small contribution will be missed. Second, locus heterogeneity, which implies that alleles at different loci cause diseases in different populations, will increase difficulty in the replication of association of a single marker.14 A gene, particularly a pathway, consists of a group of interacting components that act in concert to perform specific biological tasks. Replication of association finding at the gene level or pathway level is much easier than replication at the SNP level. Third, attempting to understand and interpret a number of significant SNPs without any unifying biological theme can be challenging and demanding. SNPs and genes carry out their functions through intricate pathways of reactions and interactions. The function of many SNPs may not be well characterized, but the function of genes and particular pathways have been much better investigated. Therefore, the gene and pathway-based association analysis allows us to gain insight into the functional basis of the association and facilitates to unravel the mechanisms of complex diseases.

To meet the conceptual and technical challenges raised by GWAS and to take full advantage of the wide opportunities provided by GWAS, the gene and pathway-based association analysis can be used as a complementary approach to the genome-wide search association of a single SNP with a disease . The gene and pathway-based association analysis considers a gene or a pathway as the basic unit of analysis. Gene and pathway-based GWAS aim to study simultaneously the association of a group of genetic variants in the same biological pathway,14, 15, 16 which can help us to holistically unravel the complex genetic structure of common diseases in order to gain insight into the biological processes and disease mechanisms.17

Gene and pathway-based GWAS can be performed by extension of a gene-set enrichment analysis for gene expression data,18 to genome-wide association studies. However, a simple application of gene-set analysis methods for gene expression data to GWAS may not work very well. The key difference between the gene expression data and SNP data is that in expression data analysis each gene is represented by one value of expression level of the gene, but in GWAS each gene is represented by a varied number of SNPs. The challenge facing us is how to represent a gene.19, 20 One promising approach is to combine P-values for correlated SNPs into an overall significance level to represent a gene and to combine P-values for the genes into an overall significance level to investigate the association of a pathway with the disease.21

Materials and methods

Gene-based association analysis

Statistical analyses for testing the association of a gene with a disease were conducted on the basis of the combination of P-values of the SNPs in the gene14. We assume that the P-values Pi are independent and uniformly distributed under their null hypotheses although the independence assumption may be violated because of linkage disequilibrium among SNPs in the gene. Several methods were used to combine independent P-values. A general framework for combining independent P-values is as follows. Let Pi be the P-value for the corresponding statistic Ti with G distribution to test the i-th marker Mi. Let H be a continuous monotonic function. A transformation of the P-value is defined as Zi=H−1(1−Pi)

Fisher's combination test

The full combination methods are to combine P-values of all SNPs within the gene. The statistic for combining K independent P-values or for combining information from K SNPs is usually given by

which follows a χ2(2K) distribution.21

Sidak's combination test (the best SNP)

If we consider only the best SNP in the gene, then the statistic is defined as ZB=P(1), which is distributed as P(ZBw)=1−(1–w)K. This statistic is often referred to as Sidak's correction.

Simes' combination test

Let P-values be ordered as P(1)P(2)P(k). The P-value is calculated as

The FDR method

Let π be the proportion of tests with a true null hypothesis and F(α) be the expected proportion of tests yielding a P-value less than or equal to α, V(α) be the expected proportion of tests giving a false positive result with significance level α.

Suppose that there are d distinct P-values among p={p1, …, pk}. Let 1<2<…<d. Let mj be the number of P-values among P that are equal to j.

Then, , where I is an indicator function. For a two-sided test define π=min(1,2), and for a one-sided test (χ2-test, trend test) define π=min(1,2ā), where Then, v(α)is estimated by v(α)=πα. Define and q(i)=minji{t(j)},

q(1)q(2)≤…≤q(m) are the ordered false discovery rates. We also take q(1)=min{t(j)} as the false discovery rate for the gene or pathway.19

Pathway-based association analysis

Consider m genes in a pathway. Assume that the P-value for each gene is calculated using one of the methods of combining independent P-values mentioned in the previous section. The methods for testing the association of a pathway with the disease are given below.

Hypergeometric test (Fisher's exact test)

Fisher's exact test is performed to search for an overrepresentation of significantly associated genes among all the genes in the pathway. We assume that the total number of genes that are of interest is N. Let S be the number of genes that are significantly associated with the disease (P-value ≤0.05, calculated by Fisher's combination test) and m be the number of genes in the pathway. Let k be the number of significantly associated genes in the pathway. The P-value of observing k-significant genes in the pathway is calculated by

Sidak's method

Both P-values for testing the association of the gene and the pathway are calculated by Sidak's method, which is described in the previous section.

Simes' method

Both P-values for testing the association of the gene and the pathway are calculated by Simes' method that is described in the previous section.

Simes/FDR method

The P-value for testing the association of the gene is calculated by Simes' method and the P-value for testing the association of the pathway is calculated by the FDR method.

Results

To investigate what should be the basic units for genome-wide association studies and to illustrate how to perform the gene and pathway-based genome-wide association analysis, we examine the 13 published GWAS (Supplementary Table 1), in which WTCCC represents the Wellcome Trust Case Control Consortium, NARAC, the North American Rheumatoid Arthritis Consortium, EIRA, the Swedish Epidemiological Investigation of Rheumatoid Arthritis, DGI, the Diabetes Genetics Initiative, AREDS, The Age-Related Eye Disease Study, CORIELL, Coriell Institute for Medical Research, and 10 diseases: bipolar disorder (BD), coronary artery disease (CAD), Crohn's disease (CD), hypertension (HT), rheumatoid arthritis (RA), type I diabetes (T1D), type II diabetes (T2D), Parkinson's disease (PD), age-related eye disease (AREDS) and Amyotrophic lateral sclerosis (ALS). As only P-values for testing the association of a single SNP (but not individual genotypes) were publically accessible, we used the statistical methods for combining independent P-values to perform gene and pathway-based GWAS (see Materials and methods). The methods for combining dependent P-values require individual genotype information and cannot be applied here. The number of typed cases and controls, the number of typed SNPs and genes, and P-values for ensuring genome-wide significance using Bonferroni correction for each study are listed in Supplementary Table 1.

The procedure for gene and pathway-based GWAS consists of two steps. The first step is to combine a set of P-values for SNPs in a gene, which is obtained from GWAS of a single SNP, into an overall significance level of the gene. The second step is to combine a set of P-values for genes in a pathway into an overall P-value for the pathway. To combine P-values, one typically assumes that the P-values are independent and uniformly distributed under the null hypothesis. In this report, four combination tests: Fisher's combination test, Sidak's combination test, Simes' combination test and a test based on false discovery rate, were used (see Materials and methods). As the SNPs within a gene may be in linkage disequilibrium, P-values of SNPs from the same gene are often not independent and hence independent assumption of combining P-values is violated. We used methods for combining independent P-values for the following reasons. First, the methods for combining dependent P-values require the data of individual genotypes. However, in many cases, individual genotypes cannot be publically accessed. Second, errors that arise from violation of independent assumptions are not very high. (We will present the results of comparison of methods combining independent P-values and those combining dependent P-values elsewhere.) Third, Q–Q plots for the four combining tests (Supplementary Figure 1) showed that the observed distribution of P-values of the combining tests (except for Fisher's combination test) matches that expected for the majority of the data, but begins to depart from the null at 3.15 × 10−6 (gene) and 10−4 (pathway).

We obtained the combined P-values for each gene. Supplementary Table 2a and 2b summarizes the total number of significant genes, significant SNPs and significant SNPs that belong to insignificant genes. The numbers of replicated SNPs and genes in the different studies, or the numbers of significant SNPs and genes shared by several diseases, are shown in Table 1. In Supplementary Tables S3–S15 we have listed all significant genes with P-values ≤3.15 × 10−6, which were calculated by the Fisher's combination test or by the test based on the false discovery rate (FDR) for 13 studies. In these tables we also included the number of typed SNPs within each significant gene and P-value of the most significant SNP in the gene. Supplementary Tables S16–S18 list the significant SNPs and genes for PA, RA and T2D diseases shared by two independent studies. Three remarkable features emerge from these tables. First, these tables show that except for the diseases RA and T1D, the number of significant SNPs in each study is very small, but the number of significant genes is quite large. From these tables we can find that the large proportion of significant genes even contains no single significant SNP. For example, in the T2D study (WTCCC), the P-values of the best SNPs in the genes PPARG, JAZF1, TSPAN8 and THADA were 0.001205, 0.001681, 0.0000156 , and 0.01080, respectively, but the overall P-values of these genes were 2.87 × 10−5, 8.58 × 10−7, 3.17 × 10−13, and 1.80 × 10−5, respectively. Although an initial single SNP analysis did not find any significant SNPs in these genes, a recent meta-analysis22 showed that the P-values of the best SNPs in these genes were 2.00 × 10−7, 5.00 × 10−14, 1.10 × 10−9, and 1.10 × 10−9, respectively. This shows that the results of the gene-based association analysis were consistent with the results of meta-analysis. If we conduct only the single-SNP association analysis, these significant genes might be missed because of the low power of small sample sizes in the initial GWAS. Second, replication of association findings at gene level in additional independent samples is much easier than that at SNP level. We examined association studies of three diseases: T2D, PA, and RA, each with two independent studies. For T2D, no SNPs were replicated in two independent studies (WTCCC and DGI) after correction for multiple tests by the Bonferroni method. However, seven genes, including genes TCF7L2 (transcription factor 7-like 2) and CDKAL1 (CDK5 regulatory subunit associated protein 1-like 1), were replicated (Supplementary Table S17). The gene TCF7L2, which has a marked effect on type II diabetes, had a widely replicated association in several studies 2, 23. In single-SNP association analysis, although a strong association of CDKAL1 was reported from WTCCC (P=1.02 × 10−6) and WTCCC/UKT2D2, 3 (P=10−8), the original scan and follow-up replication samples from DGI only support nominal association (P=0.0024). In gene-based analysis, a strong association of CDKAL1 was observed from WTCCC (P<10−20) and DGI (P=1.84 × 10−6) (Supplementary Table S17). To explain why replication of significant genes in independent samples is much easier than replication of significant SNPs, we have listed all SNPs with P-values <0.05 for the genes in Table 2. Table 2 shows that although a few single SNPs in the genes CDKAL1, TTLL5 and BTBD16 showed significant association in the WTCCC study or DGI study, the joint effects of multiple SNPs with very mild effects led to three genes being strongly associated with the diseases in both studies. Third, gene-based association analysis can more effectively identify the common genes that are shared within a disease group than single-SNP association analysis. Although there is considerable heterogeneity among complex diseases, many diseases share common phenotypes, forming a group of diseases. In the studies that we examined here, CD+RA+T1D are autoimmune diseases, and CAD+HT+T2D have metabolic and cardiovascular phenotypes in common. GWAS offers us an opportunity to reveal the genetic variants that confer a risk of more than one disease. Supplementary Table 19 summarizes the shared genes within the disease group based on the best SNP within the gene. In other words, a gene is shared within a disease group if at least one significant SNP in the gene is common within the disease group. As shown in Supplementary Table 19, based on the most significant SNPs in the gene shared within a disease group, we can only find the shared genes in the RA+T1D disease group. However, if we perform gene-based association analysis, as shown in Supplementary Table 20, we can find a number of shared genes within CD+RA+T1D, CAD+HT+T2D and RA + T1D disease groups. Numerous genome-wide gene expression analyses have shown that single-gene analysis can find little similarity between two independent studies, but pathway-based analysis may find a number of pathways in common.24 A pathway analysis is done to identify pathways that are significantly associated with the disease. In other words, we attempt to test whether the pathway is overrepresented by the genes that are significantly associated with the disease. We assembled 465 pathways from KEGG25 and Biocarta (http://www.biocarta.com). Table 3 summarizes the number of significant pathways and Table 4 summarizes the number of replicated pathways associated with the diseases RA, T2D, and PA in two independent studies, or the number of pathways shared within the diseases CAD+HT+T2D, RA+T1D, and CD+RA+T1D in the WTCCC studies. These significant pathways were identified by an overrepresentation test and the Simes/FDR method. Supplementary Tables 21–33 summarize all significant pathways with P-values ≤0.01, which were calculated by Fisher's exact test and by the Simes/FDR method for 13 studies. Supplementary Tables 34–36 list all significant pathways associated with the diseases RA, T2D and PA, which were replicated in two independent studies, and Supplementary Tables 37–39 list the significant pathways shared by the disease groups CAD+HT+T2D, RA+T1D, and CD+RA+T1D. These tables show several remarkable features that should be used to extract biological insight from GWAS. First, As shown in Table 3, a much larger proportion of pathways was significantly associated with the disease than that of genes, let alone SNPs. This implies that pathways have essential roles in causing disease. We note that many identified pathways showing significant association form the core of the pathway definition of complex diseases. For example, the MAPK pathway, JNK pathway, the ubiquitin–proteasome pathway, O-Glycan biosynthesis and Axon guidance, which showed significant association with PD in two studies (CORIELL and NCBI), have been reported as a set of major pathways implicated in PD.26, 27 Pathway-based association analysis identified NF-kB, p38 MAPK, Angiotensin II-mediated activation of the JNK pathway, activation of PKC through G-protein-coupled receptor pathway, Wnt-signaling pathway, adherens junction, melanogenesis, ECM-receptor interaction and vitamin C in the brain pathway, which form the major pathways defining T2D28 (Supplementary Table 40). Second, the results of pathway-based GWAS can be verified by functional pathway enrichment analysis of gene expressions. For example, RA is an autoimmune disease. Its major feature is a chronic inflammation of the joints. Our pathway-based association analysis identified cytokine–cytokine receptor interaction, IFN α signaling, Jak-STAT signaling, complement and coagulation cascades, and fatty acid biosynthesis pathways that were confirmed by pathway enrichment analysis of gene expression profiling of the peripheral blood cells of RA29. Third, a replication of the association of pathways in independent samples is much easier than a replication of genes or SNPs. Replications can be performed at the level of the SNP, the gene or the pathway. As shown in Table 1, no significant SNPs (using the Bonferroni method for correction of multiple tests) can be replicated in GWAS of T2D, and only seven significant genes can be replicated in the WTCCC and DGI studies. However, 10 (Simes/FDR) or 5 (Fisher's exact test) pathways can be replicated (Table 4). Risk genes may be different for different individuals, but may be in the same pathway. Identification of the pathways associated with a disease allows to easily discover the pathogenesis of the disease. Figures 1a and b plot the GnRH-signaling pathway that was associated with RA in the WTCCC studies with P-value 1.48 × 10−14 (Fisher's combination test), 0.025 (Fisher's exact test) and 0.017 (Simes/FDR), and in the NARAC and EIRA studies with P-value 1.00 × 10−17 (Fisher's combination test), 0.0055(Fisher's exact test) and 1.39 × 10−16 (Simes/FDR). Although the GnRH pathway was significantly associated with RA in both studies, the genes that showed significant association in the two studies were different. Two paths: Gs → AC → PKA → Gonadotropins gene expression and secretion and MAPK pathway (GRB2 → Sos –> Ras → Raf1 → MEK1/2 → ERK1/2 → Gonadotropins gene expression and secretion) are involved in the GnRH pathway. In the WTCCC studies, genes, such as GNAS (Gs, P-value <0.0097), ADCY2 (AC, P-value <0.000191) and PRKACB (PKA, P-value <4.48 × 10−6) in the first path showed a strong or mild association, but did not show any association in the NARAC and EIRA studies. The genes in the second path (MAPK pathway): GRB2 (P-value <1.27 × 10−5), KRAS (Ras, P-value <7.77 × 10−6) and MAP2K1 (ERK, P-value <0.005), were associated with RA in the NARAC and EIRA studies, but not in the WTCCC studies. It is well known that the endocrine system may have an important role in the pathogenesis of RA. Gonadotropins are hormones secreted by gonadotrope cells of the pituitary gland. The two major gonadotropins are luteinizing hormone and follicle-stimulating hormone. Gonadotropins have marked immunomodulatory properties and may have important roles in the pathogenesis of various immune-regulatory diseases. Sex hormone levels, including estrogen and/or progesterone in women and testosterone in men, are reported as relatively low in most RA patients.30 These observations are consistent with the disease mechanisms associated with gonadotropin. It is interesting to note that the P-values of the best SNP in genes PRKACB, GRB2 and KRAS were 0.013, 0.006 and 0.0012, respectively. This example shows that each SNP may confer a small contribution, but their joint actions may affect the functioning of the pathway, which in turn will cause the disease.

Table 1 Number of replicated or shared SNPs and genes
Table 2 Overall P-values of the genes CDKAL1, TTLL5 and BTBD16 and their SNPs with P-values less than 0.05 in WTCCC and DGI studies
Table 3 The number of pathways showing a significant association
Table 4 Number of replicated or shared pathways
Figure 1
figure 1

P-values of genes in GnRH pathway for RA. (a) P-values of genes in GnRH pathway for RA in WTCCC studies. Blocks containing significant genes are in red color, blocks containing mild significant genes are in light red color and blocks containing no significant genes are in green color. (b) P-values of genes in GnRH pathway for RA in NARAC and EIRA studies. Blocks containing significant genes are in red color, blocks containing mild significant genes are in light red color and blocks containing no significant genes are in green color.

Discussion

Despite the rapid progress of GWAS, the most widely used approach in GWAS is individual SNP association analysis. In other words, it evaluates the significance of individual SNPs. However, GWAS at only SNP level has serious limitations. It offers only a limited understanding of complex diseases as an integrated whole. What should be the future developments for GWAS? To address this issue, we proposed to take a system biology approach, which considers not only SNP but also gene and pathway as basic units of GWAS, to decipher a complex path from genotype to phenotype. The proposed paradigm for GWAS consists of three components: SNP-, gene- and pathway-based association analyses. We performed comprehensive gene and pathway-based GWAS for 11 diseases, assuming that the results of single-SNP association analysis are available. Our results showed that the proposed new paradigm for GWAS not only identified the genes that include significant SNPs found by single-SNP analysis, but also detected new genes in which each single SNP conferred a small disease risk; however, their joint actions were implicated in the development of diseases. We analysed the new genes that were identified by the new paradigm for GWAS from two aspects. First, these new findings were replicated in two independent samples. Second, the SNPs that are located in the newly identified genes were not significant in any of their original studies, but showed strong association in the recently published meta-analysis of genome-wide association data and large-scale replication. Our results also strongly showed that the replication of an association finding at the gene or pathway level is much easier than replication at the individual SNP level. One of the major advantages offered by the new paradigm for GWAS is that the pathway-based analysis can add structure to genomic data and allows us to gain insight into a deeper understanding of cellular processes as intricate networks of functionally related genes. We further showed that the new paradigm can also offer opportunities for finding the pathways that are common within disease groups. We used RA as an example to show that the pathways identified by the new paradigm for GWAS can be confirmed by a gene-set-rich analysis using gene expression data. This implies that the new paradigm for GWAS will open a new avenue to integrate GWAS with other functional analyses and hence will facilitate to uncover the mechanism of complex diseases.

As the current GWAS only report the P-value for a single SNP, and the individual genotype data are not publically available, our methods for a gene and pathway-based GWAS are designed for the P-value data. The major tool for gene and pathway-based analyses is to combine independent P-values of single SNPs in the gene into an overall P-value for the gene and independent P-values of a single gene in the pathway into an overall P-value for the pathway. As the SNPs in a gene are often dependent, we need methods for combining dependent P-values, which in turn require individual genotype information. The limitation of the proposed gene and pathway-based association analysis is that it is based on combining independent P-values and is not appropriate to be applied to dependent data. Therefore, the P-values for the gene or pathway, which are calculated by Fisher's method of combining independent P-values of SNPs, will be inflated if there exist large correlations among SNPs in the gene. A gene and pathway-based analysis that uses methods to combine dependent P-values will be needed. Gene and pathway-based GWAS that take correlations among the SNP and genes into account will be carried out in the near future.