Introduction

The completion of the human genome and HapMap projects combined with advances in high throughput genotyping techniques have resulted in an explosion of genome-wide association studies (GWAS).1 These studies interrogate hundreds of thousands to a few million genetic variants and have identified a large number of loci associated with phenotypic traits or disease outcomes. As a result of their early and continued success, the number of published GWAS has steadily increased each year, from just two in 2005 to 238 in 2010 (as of 8 December; data from the statistic page in GWAS Integrator). To help the research community find these publications and further explore the reported associations, the National Human Genome Research institute (NHGRI) has established, and maintains the NHGRI GWAS Catalog (http://www.genome.gov/26525384), an online, regularly updated database of single nucleotide polymorphism (SNP)-trait associations from GWAS.2 We have developed the GWAS Integrator, a bioinformatics tool that offers a robust search capacity and a set of data mining functions by integrating information from the NHGRI GWAS Catalog, with data from other established bioinformatics resources including HapMap (http://hapmap.ncbi.nlm.nih.gov/), the Human Genome Epidemiology (HuGE) Navigator (http://www.hugenavigator.net/), SNP Annotation and Proxy Search (SNAP) (http://www.broadinstitute.org/mpg/snap/ldsearch.php) and University of California Santa Cruz (UCSC) genome browser (http://genome.ucsc.edu/cgi-bin/hgGateway).

Implementation

The GWAS Integrator was built on J2EE technology (http://java.sun.com/javaee/) and on other Java open-source frameworks, including Hibernate (http://www.hibernate.org/), Strut (http://struts.apache.org/), and JChart (http://jcharts.krysalis.org/). The database is populated and updated with SNP-trait associations from the NHGRI GWAS Catalog each week when new associations are available; details about the selection criteria for these associations are available on the NHGRI GWAS Catalog website. Chromosomal locations of the associated SNPs and relevant proxy SNPs are downloaded from SNAP and UCSC, added to the database as needed (NCBI Build 36/UCSC Version 18 (hg18)). Records from the NCBI Entrez Gene database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=gene) are used as standards for gene information, including chromosomal location. As a component of the HuGE Navigator,3 the GWAS Integrator can take advantage of the established informatics infrastructure used in this integrated knowledge on the basis of human genome epidemiology. The HuGE literature database includes PubMed abstracts indexed with MeSH terminology (http://www.ncbi.nlm.nih.gov/sites/entrez?db=mesh); allowing use of the MeSH tree hierarchies and the Unified Medical Language System metathesaurus for mapping different synonyms of the phenotype/disease terms into a standard code enhances the search capacity of the GWAS Integrator. In addition, genes indexed in the HuGE literature database can be used to identify relevant candidate genes. The detailed schema for the HuGE Navigator database can be found in the paper by Yu et al.4

Features

Robust search capacity

Users can perform free text searches of data extracted from published GWAS. Searchable terms include the disease/trait, gene name/gene symbol/gene alias, rs number, first author name, journal, chromosome region, platform, PubMed ID, and any text in the publication title or abstract. Search results can be filtered by variant, gene, region, trait, publication, author, journal, and year, as well as by ‘hit’ (ie, the SNP-trait association identified in a GWAS). Results can be filtered multiple times. The filtering function also can be used to obtain a quick snapshot of GWAS published in a particular research field. For example, a user can easily get descriptive statistics for GWAS on breast cancer, including the number of variants that have been studied, the number of GWAS publications, etc (Figure 1a).

Figure 1
figure 1figure 1figure 1

Illustration screen shots for GWAS Integrator. (a) Display of the GWAS hits related to breast cancer. (b) Display of SNP proxies of the variants related breast cancer. (c) Display of dynamically-generated SNP and candidate gene UCSC custom tracks related to breast cancer. (d) Display of GWAS hits from proxy SNPs of GWAS hits related to breast cancer. (e) Display of all genes that fall into the region around the selected GWAS hits related to breast cancer.

Data mining capacity

A series of data mining capacities can be used to further explore search results.

Variant->proxy function

This function provides information on SNP proxies related to the variants (SNP) of the selected GWAS hits. Users can define configuration parameters for proxy SNP retrieval, such as the HapMap release version, HapMap population, and r2 cutoff (Figure 1b).

Variant->UCSC function

This function dynamically creates an SNP custom track to display selected GWAS hits in the UCSC Genome Browser. Users can select the SNP to center the display in the UCSC Genome Browser using a dropdown menu, which lists all the rs numbers for the selected GWAS hits. The ‘Window Size’ field defines the display range around the centered SNP in the UCSC Genome Browser; for example, when 500 kb is specified in the Window Size field, the UCSC Genome Browser will display 250 kb on each side of the centered SNP. Users can also include proxy SNPs in the SNP custom track, or create a separate custom track for genes indexed in the HuGE literature database related to the query (Figure 1c).

Variant->GWAS function

This function uses proxy SNPs to identify additional GWAS hits that may be related to the user-selected GWAS hits. Users can define configuration parameters for proxy SNP retrieval, such as HapMap release version, HapMap population, and r2 cutoff (Figure 1d).

Variant->gene function

This function lists all genes that fall into the region around the selected GWAS Hits. Users can define the genomic distance around the hits. Genes that are also indexed in the HuGE literature database and reported with the query term are highlighted with a hyperlink to the corresponding Genopedia5 record in HuGE Navigator (Figure 1e).

Proxy reference search

Users can also search for variant-trait associations using proxy SNPs. For example, searching with ‘rs663129’ will lead to six proxy SNPs that have GWAS hits.

Real-time tracking

The statistics page presents an overview of published GWAS, including total numbers of publications, hits, reported genes, genic SNPs, intergenic SNPs, variants, and disease/traits. Temporal trends are displayed graphically for each item. A top 10 list is generated and displayed in web tables, including variant, gene, chromosome region, disease/trait, first author, and journal. As of 10 February 2011, the database contains 4817 GWAS hits, representing 475 disease/traits and 3920 variants from 796 publications.

Conclusion

GWAS of phenotypic traits and diseases have successfully identified a large number of genetic loci for further investigation by replication, meta-analysis, imputation of untyped loci, resequencing, identification of functional polymorphisms, and analysis of gene–gene and gene–environment interactions.6 By integrating relevant information from multiple data sources, the GWAS Integrator helps researchers to quickly identify GWAS of interest, examine the findings in the context of other genetic and epidemiologic research, perform on-line data mining, and make inferences that can inform future studies. Including the GWAS Integrator in the HuGE Navigator allows it to take advantage of the established informatics infrastructure of one of the most comprehensive repositories of published genetic associations – the HuGE literature database. The dynamic generation of a custom track is an efficient way to access all features offered by the UCSC genome browser. Ongoing collaboration between CDC and NHGRI in collecting and synchronizing GWAS data will guarantee the most updated GWAS data sources. As a new application in HuGE Navigator, GWAS Integrator was built to interconnect to other applications already in the system, such as Genopedia, Gene Prospector, etc, so that navigation to other information is provided. Although GWAS Integrator database content is currently limited to the NHGRI GWAS Catalog, we plan to implement a feature that allows users to import their own GWAS data for data mining.