Dear Editor,
Since 20051, genome-wide association studies (GWAS) and Next Generation Sequencing (NGS) have opened up new realms of investigation into the association between different diseases/traits and a large number of genetic loci. To date, more than 1 200 GWAS on over 250 traits have been published2. The advent of NGS and affordable whole genome and exome sequencing has accelerated the discovery of the genes underlying Mendelian diseases and has also enhanced our ability to detect rare variants. These variants may explain the missing heritability of common diseases and specific traits3.
To discriminate the true trait/disease-associated SNPs (TASs) from the large number of loci discovered by GWAS and NGS studies, we need further downstream statistical and bioinformatics analyses4. Variant visualization, functional annotation and prioritization are vital for determining the true associations between genetic markers and diseases/traits, from the multiple signals many of which represent chance findings. In addition, an increasing number of requirements, such as clarity, diversity and interactivity, pose demands on data visualization5,6. Thorough functional annotations based on genomic location and potential biological effects are needed, especially for markers with moderate effects located in regulatory regions (e.g., non-coding RNAs, enhancers and promoters, and those in evolutionarily conserved regions)7,8,9, and for markers that have functional interactions with other TASs10. Comprehensive variant annotation will undoubtedly accelerate this process. Existing bioinformatics tools such as ANNOVAR11 and VAAST12 use genomic mapping to produce variant annotation, but few tools focus on the potential functional effects of TASs. Importantly, to properly distinguish the true association of TASs from the huge amount of GWAS signals, particularly for hidden TASs with moderate P-value, an annotation-based prioritizing process is required. Therefore, tools that can visualize, annotate and prioritize such data are urgently needed.
We have developed the GWASrap tool (http://jjwanglab.org/gwasrap) that systematically supports genetic variant representation, annotation and prioritization for data generated from GWAS and NGS (Figure 1 and Supplementary information, Data S1). Our web-based framework utilizes state-of-the-art web technologies to maximize user interaction and visualization of the results (Supplementary information, Figure S3). For a given SNP dataset with its P-values, GWASrap will first provide a Circos-style plot to visualize any genetic variants at either the genome or chromosome level (Supplementary information, Figures S5 and S6). The tool then combines different genomic features (SNP/CNV density, disease susceptibility loci, etc.) with comprehensive annotations that give the researcher an intuitive view of the functional significance of the different genomic regions (Supplementary information, Figure S4). The detailed statistics of the underlying study are also displayed on the web page, including variant distribution in different functional categories, classic Manhattan plot and QQ plot (Supplementary information, Figure S8). Users can perform interactive operations in the Manhattan panel, such as zooming in and out to search regions or markers of interest (Supplementary information, Figure S7). The system can also display a comprehensive range of relevant information from variant genetic attributes to nearby genomic elements, such as enhancers or non-coding RNAs. Furthermore, researchers can obtain extensive functional predictions for various features including transcription factor-binding sites, miRNA and miRNA target sites, and their predicted changes caused by the genetic variants (Supplementary information, Table S1). Our system can re-prioritize genetic variants by combining the original statistical value and variant prioritization score based on a simple additive effect equation (Supplementary information, Figures S1, S2 and Table S2). Researchers can also re-evaluate the significance of a TAS using the dynamic linkage disequilibrium (LD) panel (Supplementary information, Figure S8) or the tree-like network panel (Supplementary information, Figure S10). The GWASrap supports input variants in different formats, not only common variants with a dbSNP rs ID but also rare variants from NGS data, which are represented by chromosome and locations.
We used a Circos-style13 plot to represent all SNPs in the NHGRI GWAS Catalog (up to February 2012). Compared to the original graph and the traditional Manhattan plot, our plot provides a very broad horizontal area for sanity checking of the current status of GWAS data. For a given GWAS dataset for a specific disease/trait, researchers can easily locate the significant region by looking at the single chromosome plot or by interacting with the dynamic Manhattan panel. The surrounding features and glyphs contain sufficient genetic information to provide an intuitive overview of the GWAS results.
In the prioritization step, we first estimated the likelihood of disease associations for each type of SNP (missense, nonsence, synonymous, etc.) by mapping those data to the dataset of HapMap3. We then computed the variant's genomic mapping score and functional prediction score based on the variant's annotation information. The prioritization score was then computed from the product of above scores (Supplementary information, Data S1). We have tested our prioritization method on several different datasets and found it to be very reliable. We first applied our prioritization method to disease-causal SNPs from the OMIM database, and found that the prioritization scores were significantly different between disease-causal SNPs and randomly sampled background SNPs (Supplementary information, Figures S11–S13). Next, we re-scored the top 100 SNPs from the GWAS Catalog taking into account the synthetic associations of the LD proxy. The resulting scores were closer to that of the benchmarked 100 OMIM SNPs compared with the original scores and random background (Supplementary information, Figure S14). Finally, by applying the prioritization method to a bipolar disorder (BPD) GWAS study, we successfully selected the variants with the strongest effect that were confirmed in a separate study (Supplementary information, Tables S3 and S4). For example, SNP rs1042779 in ITIH1gene, with many deleterious attributes, obtained a progressively stronger signal after the prioritization step. The variant is highly associated with a non-synonymous variant rs11177 in GNL3 gene and a synonymous variant rs2251219 in PBRM1 gene, which had a very significant signal in a recent BPD GWAS (Supplementary information, Figure S15). In addition, an intronic variant rs420259, which was indicated as a high-risk marker in another GWAS, scored considerably higher after applying our prioritization method.
We have built a local database designed around a user-friendly web interface and web services to provide a rapid diagnostic tool for genetic variants. Our system accepts a submission query as either a dbSNP ID or a chromosomal location, and will quickly return the annotation information displayed on an interactive LD panel (Supplementary information, Figure S9). In the case of a rare variant or de novo variant without a dbSNP ID, the user can use genomic coordinates to obtain sufficient annotation. Our database also hosts well-structured and up-to-date repositories from all significant TASs found in popular diseases frequently investigated by GWAS. For each specific disease/trait, we have integrated significant SNPs for the disease/trait, which were annotated in the GWASdb database14.
In summary, our system offers a universal web portal for GWAS/NGS representation, annotation and prioritization. The system will benefit users for data visualization and facilitate the functional annotation of genetic variants discovered by GWAS and NGS studies.
References
Klein RJ, Zeiss C, Chew EY, et al. Complement factor H polymorphism in age-related macular degeneration. Science 2005; 308:385–389.
Hindorff LA, Sethupathy P, Junkins HA, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 2009; 106:9362–9367.
Bamshad MJ, Ng SB, Bigham AW, et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 2011; 12:745–755.
Wang K, Li M, Hakonarson H . Analysing biological pathways in genome-wide association studies. Nat Rev Genet 2010; 11:843–854.
Robinson JT, Thorvaldsdottir H, Winckler W, et al. Integrative genomics viewer. Nat Biotechnol 2011; 29:24–26.
Pruim RJ, Welch RP, Sanna S, et al. LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics 2010; 26:2336–2337.
Wang J . A database of genetic variants in microRNA genes and their putative functional roles in gene regulation. Hum Mutat 2012; 33:vii–vii.
Wang W, Wei Z, Lam TW, Wang JW . Next generation sequencing has lower sequence coverage and poorer SNP-detection capability in the regulatory regions. Sci Rep 2011; 1:55.
Wei Z, Jensen ST . GAME: detecting cis-regulatory elements using a genetic algorithm. Bioinformatics 2006; 22:1577–1584.
Hu X, Liu Q, Zhang Z, et al. SHEsisEpi, a GPU-enhanced genome-wide SNP-SNP interaction scanning algorithm, efficiently reveals the risk genetic epistasis in bipolar disorder. Cell Res 2010; 20:854–857.
Wang K, Li M, Hakonarson H . ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010; 38:e164
Yandell M, Huff C, Hu H, et al. A probabilistic disease-gene finder for personal genomes. Genome Res 2011; 21:1529–1542.
Krzywinski M, Schein J, Birol I, et al. Circos: an information aesthetic for comparative genomics. Genome Res 2009; 19:1639–1645.
Li MJ, Wang P, Liu X, et al. GWASdb: a database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res 2012; 40 (Database issue):D1047–D1054.
Acknowledgements
This study was supported by grants from the Research Grants Council (781511M, 778609M, N_HKU752/10, AoE M-04/04), Food and Health Bureau (10091262) of Hong Kong, and The University of Hong Kong Strategic Research Theme on Genomics.
Author information
Authors and Affiliations
Corresponding author
Additional information
( Supplementary information is linked to the online version of the paper on the Cell Research website.)
Supplementary information
Supplementary information, Data S1
Methods (PDF 773 kb)
Supplementary information, Figure S1
The percentage of TASs in GWASdb and in HapMap3 for different types of genetic variants. (PDF 127 kb)
Supplementary information, Figure S2
The overview of prioritization pipeline (See Supplementary Information, Data S1 for details). (PDF 203 kb)
Supplementary information, Figure S3
Different styles of GWAS representations: (a) The plotting style used in GWAS catalog. (PDF 279 kb)
Supplementary information, Figure S4
Major components of GWAS representation. (PDF 269 kb)
Supplementary information, Figure S5
The GWAS representation for Diabetes Mellitus. (PDF 215 kb)
Supplementary information, Figure S6
The GWAS representation of chromosome 1 for Diabetes Mellitus. (PDF 168 kb)
Supplementary information, Figure S7
The major components of the GWAS annotation: (a) The interactive Manhattan plot can be zoomed in and out and can be queried. (PDF 200 kb)
Supplementary information, Figure S8
Statistics for a given GWAS: (a) The distribution of variants in different parts of the genes. (PDF 288 kb)
Supplementary information, Figure S9
An interactive HapMap LD panel for the target variant. (PDF 110 kb)
Supplementary information, Figure S10
A prioritization tree depicts the deleteriousness of the selected variants. (PDF 160 kb)
Supplementary information, Figure S11
The distribution of prioritization scores for OMIM disease-causal SNPs and randomly selected SNPs. (PDF 145 kb)
Supplementary information, Figure S12
The distribution of prioritization scores for the intergenic/intronic OMIM disease–causal SNPs and that of randomly selected SNPs. (PDF 121 kb)
Supplementary information, Figure S13
The distribution of prioritization scores for the exonic OMIM disease-causal SNPs and that of randomly selected SNPs. (PDF 125 kb)
Supplementary information, Figure S14
The boxplots of prioritization scores for GWAS top 100 TASs, with and without synthetic associations, OMIM random SNPs and randomly selected SNPs. (PDF 152 kb)
Supplementary information, Figure S15
Investigation of rs1042779 from a bipolar disorder GWAS. (PDF 204 kb)
Supplementary information, Table S1
Resources used for genomic mapping and functional effect prediction in GWASrap. (PDF 148 kb)
Supplementary information, Table S2
Comparison of prioritization score statistics for different datasets and their SNP deleteriousness types. (PDF 94 kb)
Supplementary information, Table S3
Information for the top 10 prioritized SNPs from the bipolar disorder GWAS. (PDF 138 kb)
Supplementary information, Table S4
Information for the top 10 prioritized SNPs from the bipolar disorder GWAS with LD proxy. (PDF 182 kb)
Rights and permissions
About this article
Cite this article
Li, M., Sham, P. & Wang, J. Genetic variant representation, annotation and prioritization in the post-GWAS era. Cell Res 22, 1505–1508 (2012). https://doi.org/10.1038/cr.2012.106
Published:
Issue Date:
DOI: https://doi.org/10.1038/cr.2012.106
This article is cited by
-
The variome concept: focus on CNVariome
Molecular Cytogenetics (2019)
-
HPOAnnotator: improving large-scale prediction of HPO annotations by low-rank approximation with HPO semantic similarities and multiple PPI networks
BMC Medical Genomics (2019)
-
Principles and methods of in-silico prioritization of non-coding regulatory variants
Human Genetics (2018)
-
Long non-coding RNAs as novel players in β cell function and type 1 diabetes
Human Genomics (2017)
-
Long noncoding RNA LINC00305 promotes inflammation by activating the AHRR-NF-κB pathway in human monocytes
Scientific Reports (2017)