A pathway analysis of genome-wide association study highlights novel type 2 diabetes risk pathways

Genome-wide association studies (GWAS) have been widely used to identify common type 2 diabetes (T2D) variants. However, the known variants just explain less than 20% of the overall estimated genetic contribution to T2D. Pathway-based methods have been applied into T2D GWAS datasets to investigate the biological mechanisms and reported some novel T2D risk pathways. However, few pathways were shared in these studies. Here, we performed a pathway analysis using the summary results from a large-scale meta-analysis of T2D GWAS to investigate more genetic signals in T2D. Here, we selected PLNK and VEGAS to perform the gene-based test and WebGestalt to perform the pathway-based test. We identified 8 shared KEGG pathways after correction for multiple tests in both methods. We confirm previous findings, and highlight some new T2D risk pathways. We believe that our results may be helpful to study the genetic mechanisms of T2D.

Using PLINK, we got 806 significant T2D risk genes with P < 0.05. Using WebGestalt, 731 of these 806 genes were mapped to 731 unique Entrez Gene IDs. We identified that these 731 genes were significantly enriched in 11 KEGG pathways with a Bonferroni correction test P < 0.05/220 = 2.27E-04 (supplementary Table 1). Metabolic pathways (hsa01100) is the most significant pathway with P = 4.92E-07. More detailed information about these 11 KEGG pathways is described in supplementary Table 1.
Using VEGAS, we got 1514 significant T2D risk genes with P < 0.05. Using WebGestalt, 1343 of these 1514 genes were mapped to 1514 unique Entrez Gene IDs. We further identified that these 1343 genes were significantly enriched in 44 KEGG pathways with a Bonferroni correction test P < 0.05/220 = 2.27E-04 (supplementary Table 2). Metabolic pathways (hsa01100) is the most significant pathway with P = 4.50E-10. More information about these 44 KEGG pathways is described in supplementary Table 2.
We further compared the T2D risk genes identified by the PLINK (n = 806) and VEGAS (n = 1514) methods. We found that 570 T2D risk genes were shared in these two methods. We also compared the pathway analysis results using T2D risk genes from PLINK and VEGAS. We identified 8 KEGG pathways to be shared in both methods with a Bonferroni correction test with P < 0.05/220 = 2.27E-04. Here, we list the results about the 8 shared pathways in Table 1. In order to get a better idea of these 8 pathways involved and how they correlated to each other, we provided pathway relationship in Fig. 1  Abbreviations for all the six statistics in enrichment analysis: C, the number of reference genes in the category; O, the number of genes in the gene set and also in the category; E, expected number in the category; R, the ratio of enrichment, Figure 1. The pathway relationship of 8 KEGG pathways by pathway analysis of significant T2D risk genes from PLINK. The weight of pair-wise pathways is based on their related genes is defined by the Jaccard Index, given by the ratio of the intersection and union of the two gene sets.
metabolism, Type II diabetes, TGF-signaling pathway. Interestingly, we successfully replicated Type II diabetes as described in supplementary Table 2 16 . Zhong et al. integrated the pathway analysis and gene expression into T2D WTCCC GWAS dataset 17 . They highlighted the involvement of 22 KEGG pathways in T2D risk 17 . Interestingly, we successfully replicated 9 of these 22 pathways including Tight junction, Neuroactive ligand-receptor interaction, Cell cycle, and Antigen processing and presentation, as described in supplementary Table 2. Meanwhile, both the Tight junction and Cell cycle are included in the 8 shared pathways with P < 0.05/220 = 2.27E-04.

Discussion
Until now, pathway analysis of T2D GWAS dataset has been performed 16,17 . Here, we performed a pathway analysis of large-scale T2D GWAS meta-analysis dataset, and identified 9 significant pathways shared in both methods. We confirm previous findings, such as Tight junction and Cell cycle 17 . Here, we also highlight some new T2D risk pathways, such as Melanogenesis, Vibrio cholerae infection, Gastric acid secretion, Phagosome, Ubiquitin mediated proteolysis, and Protein processing in endoplasmic reticulum.
It is reported that pathway analysis of GWAS datasets may be more successful for some complex traits 21 . However, some pathway analysis methods may have limitations and the analysis results may be unstable 21 . The gene set size and gene length, LD patterns and the presence of overlapping genes could inflate the gene-score p-values and further may cause biases in pathway analysis 21,22 . It is suggested that multiple methods should be used to evaluate the reliability of the results 21 . Here, we selected PLNK 19 and VEGAS 20 to perform the gene-based test. Both PLNK and VEGAS could adjust the gene length and LD patterns.
There are some differences in PLINK and VEGAS. First, there are different statistic methods in PLINK and VEGAS. PLINK applied an approximate Fisher's test to combine the P values across all the SNPs in genes 20 . In VEGAS, all the P values across all the SNPs in genes are converted to uppertail chi-squared statistics with one degree of freedom 20 . VEGAS applied the summarized chi-squared 1 degree of freedom statistics within a specific gene 20 . PLINK uses the average association test statistic across a given set of SNPs as the "set-based" test statistic 20 . VEGAS software uses the sum rather than average 20 . Second, there are different methods to map the SNPs to their corresponding genes in PLINK and VEGAS. In PLINK, a given set of SNPs were mapped to genes if these SNPs were located within the genomic sequence corresponding to the start of the first exon and the end of the last exon of any transcript corresponding to that gene 23 . In VEGAS, a given set of SNPs were mapped to ±50 kb of the 5′ and 3′ UTR of the corresponding genes 20 . Third, there are different methods to account for the LD, gene size, and the P value in PLINK and VEGAS. PLINK uses permutation test and VEGAS uses the simulation test 20 . Fourth, VEGAS method is much faster than the PLINK set-based test 20 . Here, we focus on the overlapping pathways resulting from the genes from PLINK and VEGAS methods based on these differences above.
Until now, two pathway analyses have been performed 16 17 . We think that the integration of pathways and gene expression may be more powerful compared with the single pathway analysis, which may further reduce the number of significant pathways. In summary, we analyzed the large-scale T2D GWAS meta-analysis dataset using PLNK and VEGAS. We not only confirm previous findings, but also highlight some new T2D risk pathways. We believe that our results may be helpful to study the genetic mechanisms of T2D. We will further replicate our findings using other available pathway analysis methods in future. Further replication studies are also required to evaluate our findings.

Materials and Methods
The T2D GWAS dataset. The GWAS dataset came from a meta-analysis of T2D GWAS datasets from DIAGRAM Consortium (http://diagram-consortium.org/downloads.html) 18 . Here, we give a brief description about the GWAS meta-analysis dataset 18 . More detailed information is provided in the original studies 18 . The meta-analysis consists of 12,171 T2D cases and 56,862 controls from 12 T2D GWAS from European descent populations 18 . All these samples were genotyped using kinds of genotyping platforms 18 . Each SNP was tested for association with T2D under an additive model after adjustment for study-specific covariates including indicators of population structure 18 . Full details of genotyping, quality control and imputation in each study are described in Supplementary Table 1 of the original study 18 . We got the association summary statistics from the original study 18 .
Gene-based test using PLINK. PLINK (SET SCREEN TEST) was used to perform the gene-based test of the T2D GWAS dataset in the gene level 19 . This method is based on the meta-analysis of all the SNPs in genes using the linkage disequilibrium (LD) information from the HapMap CEU population 23 . PLINK applied an approximate Fisher's test to combine P values across all the SNPs in genes to get the overall significance 23 . Meanwhile, PLINK supports larger genes (genes with more SNPs), and uses permutation testing to account for the correlation between SNPs (LD), gene size, and the P value from gene based test 23 . Gene-based test using VEGAS. VEGAS was also applied to conduct a gene-based test of the T2D GWAS dataset in the gene level 20 . The software utilizes all SNPs within a gene and adjusts the gene sizes, SNP density, and LD relation in SNPs 20 . VEGAS first assigns SNPs within ±50 kb from the 5′ and 3′ UTR to the corresponding genes according to the position information 20 . In a given gene, all the SNPs association P values are converted to uppertail chi-squared statistics with one degree of freedom 20 . The gene-based test statistic is the summarized chi-squared 1 degree of freedom statistics within a specific gene 20 . Meanwhile, simulations are used to adjust the LD relation in SNPs within a specific gene using the HapMap2 CEU genotype dataset 20 .