eSNPO: An eQTL-based SNP Ontology and SNP functional enrichment analysis platform

Genome-wide association studies (GWASs) have mined many common genetic variants associated with human complex traits like diseases. After that, the functional annotation and enrichment analysis of significant SNPs are important tasks. Classic methods are always based on physical positions of SNPs and genes. Expression quantitative trait loci (eQTLs) are genomic loci that contribute to variation in gene expression levels and have been proven efficient to connect SNPs and genes. In this work, we integrated the eQTL data and Gene Ontology (GO), constructed associations between SNPs and GO terms, then performed functional enrichment analysis. Finally, we constructed an eQTL-based SNP Ontology and SNP functional enrichment analysis platform. Taking Parkinson Disease (PD) as an example, the proposed platform and method are efficient. We believe eSNPO will be a useful resource for SNP functional annotation and enrichment analysis after we have got significant disease related SNPs.

In this study, taking eQTL as medium between SNPs and their functions, we integrated eQTL and GO information and constructed a human SNP Ontology database and SNP functional enrichment analysis platform. It will be an efficient tool after GWAS analysis for a complex trait.
Material and Methods eQTL data. The eQTL data were collected from several open databases and literatures. The gene expression patterns are specific among tissue types, and so do the eQTL patterns. Therefore, a classification by tissue types is necessary. We classified them into 12 tissues (Table 1). We combined the data from different studies of same tissue type. For each data, we set a significant threshold of FDR < 0.05. We retained only the SNPs with reference names and genes with gene symbols. In each tissue type, the numbers of samples, SNPs and genes are all after the screening.
Brain data. As Parkinson Disease (PD) is a disorder of the central nervous system, we selected eQTL data in brain for a case study from Gibbs et al. 21 and Myers et al. 22 . In Gibbs et al. 's study, four frozen tissue samples of the cerebellum (CRBLM), frontal cortex (FCTX), caudal pons (PONS) and temporal cortex (TCTX) were obtained from 150 neurologically normal Caucasian subjects resulting in 600 tissue samples. SNP genotyping was performed using Infinium HumanHap 550 beadchips (Illumina) for 561,466 SNPs. Profiling of 22,184 mRNA transcripts was performed using HumanRef-8 Expression BeadChips (Illumina). For each of the four brain regions, a regression analysis was performed using Plink 23 . After eQTL analysis in each brain regions, we integrated the results. In Myers et al.'s study, whole-genome genotyping for 366,140 SNPs and expression analysis of 14,078 genes were carried out on a series of 193 neurologically normal human brain samples using the Affymetrix GeneChip Human Mapping 500 K Array Set and Illumina HumanRefseq-8 Expression BeadChip platforms. A one-degree-of-freedom allelic test of association analysis was performed using Plink 23 . We integrated the results from these 2 studies. Finally, we got 51,131 significant correlations between 22,740 SNPs and 7,161 genes with the threshold of FDR < 0.05.
Gene annotation data. The gene annotation data was downloaded from the Gene Ontology (GO) database (www.geneontology.org/page/download-annotations) 7,8 . ESNPO construction. We defined associations between SNPs and GO terms via combining the associations between SNPs and genes from eQTL and the associations between genes and GO terms from GO annotation database. A SNP and GO term with at least one common gene will be connected for an association. It was illustrated in Fig. 1.

SNP functional enrichment analysis.
We performed Fisher exact test to estimate the significance of associations between SNPs and GO terms. The Fisher exact test is equal to Hypergeometric test. Suppose there are N SNPs and M disease-related SNPs in eSNPO. For a given GO term, there are n SNPs and m disease-related SNPs. The p value is estimated as follows.

P value adjustment.
In an analysis, multiple GO terms are tested for significance and the Type I error would increase. Therefore, a multiple test adjustment is needed after estimating p values. There are 7 p value adjustment methods adopted using p.adjust function in R. The Bonferroni correction ("bonferroni") 24 in which the p values are multiplied by the number of comparisons. Less conservative corrections are also included by Holm ("holm") 25 , Hochberg ("hochberg") 26 , Hommel ("hommel") 27 , Benjamini & Hochberg ("BH" or its alias "fdr") 28 , and Benjamini & Yekutieli ("BY") 29 , respectively. There is no golden standard to compare these methods, and the most popular method is False Discovery Rate method. The False Discovery Rate (FDR) is one way of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. In this study, we used the "fdr" method.

Database
After all, we construct a SNP Ontology and SNP functional enrichment analysis platform (http://bioinfo.hrbmu. edu.cn/esnpo/ or http://nclab.hit.edu.cn/esnpo/). It mainly includes 2 functions, eQTL-based SNP functional annotation and SNP functional enrichment analysis. After removing redundancy, we got 699,445 associations between 21,123 SNPs and 11,714 GO terms. The detailed statistics for the 12 tissues were illustrated in Table 2.

The GO terms are formed by 3 components, Biological Process (BP), Cellular Component (CC) and Molecular Function(MF).
Case study PD SNPs data. PD is a degenerative disorder of the central nervous system mainly affecting the motor system. We used 2,034 unique PD-related SNPs in Guiyou Liu et al. 30 . These SNPs came from these following works: 41 SNPs were from the GWAS Catalog 3 ; 70 SNPs were from a large PD GWAS with over 3,400 cases and 29,000 controls conducted by Do et al. 31 ; 783 SNPs were from a meta-analysis of PD GWAS with 4,238 PD cases and 4,239 controls performed by Pankratz et al. 32 ; 1,292 SNPs were from a meta-analysis of PD GWAS using a common set of 7,893,274 variants across 13,708 cases and 95,282 controls conducted by Nalls et al. 33 . The threshold of p values in these studies were set to be 5.00E− 08. After removing redundancy, we selected 2034 unique SNPs with P < 5.00E− 08.  From the results, though there are fewer annotated GO terms in eSNPO than position-based method, there are higher proportion of significant results in eQTL-based method.
Furthermore, we further verified these significant GO terms only obtained in eQTL-based method (8 BP terms, 8 CC terms and 8 MF terms). Of these 8 BP terms, there are 2 terms about apoptotic signaling pathway 47 , 1 term about cell proliferation 48,49 , 1 term about cell adhesion 50 , 2 term about JUN phosphorylation 51 which have been verified by other studies.

Conclusion
In this work, we constructed an eQTL-based SNP Ontology and SNP functional enrichment analysis platform (http://bioinfo.hrbmu.edu.cn/esnpo/ or http://nclab.hit.edu.cn/esnpo/). We integrated the eQTL data and GO, constructed associations between SNPs and GO terms, then performed functional enrichment analysis. Taking PD as an example, this eQTL-based method is an efficient method as the position-based method. Therefore, we believe it is a useful SNP functional enrichment analysis resource after we selected significant disease related SNPs.
However, there are still some shortages in this method. The first is there may not be enough suitable eQTL data we can use. And the second is that the scale of eSNPO is far less than the position-based method. These shortages will be solved along with more and more eQTL studies have been done.