To the editor:
The increasing availability of polymorphism data has allowed more gene association studies to be carried out and the number of published genetic association studies is growing rapidly. Studies done secondarily to successful linkage studies over the last decade have also fueled the increase in published association studies. Although there are single-nucleotide polymorphism and human variation databases1,2, there is currently no public repository for genetic association data. It is difficult to query association data in a systematic manner or to integrate association data with other molecular databases. OMIM3, the main repository of genetic information for mendelian disorders, is largely text based and is of a historical narrative design, making it difficult to compare large sets of molecular data. Moreover, OMIM archives mature, high-quality data of high significance, the standard in rare mendelian disorders. Although this data is useful, OMIM does not routinely collect findings of lower significance or negative findings. The study of nonmendelian, common complex disorders is often a struggle to find disease relevance with lower significance values, and often conflicting evidence. Negative data are often not reported or are marginalized into obscure and less accessible scientific journals, resulting in a publication bias favoring positive genetic associations4. Here, we describe the development of a genetic association database (GAD; http://geneticassociationdb.nih.gov) that aims to collect, standardize and archive genetic association study data and to make it easily accessible to the scientific community.
There are no standards for designing, implementing, interpreting or reporting association studies (e.g., sample size, replication, significant P values), although guidelines have been suggested4,5,6,7. The literature is filled with alternative, idiosyncratic and arbitrary gene names and gene symbols, as well as a continuum of phenotypic descriptions. Studies using arbitrary nomenclature continue to be published, making cross-comparison and meta-analysis difficult. One goal of GAD is to standardize molecular nomenclature in the archival process by including official HUGO gene symbols. After this assignment, each record is annotated with links to molecular databases (LocusLink, GeneCards, HapMap, etc.) and reference databases (PubMed, CDC), among others. Once they are standardized, integrating association data with other molecular databases, data mining tools, annotation and future sources of molecular data (e.g., gene interactions, quantitative trait loci) can be done systematically. Moreover, cross-comparison and meta-analysis of studies becomes more efficient.
There are three main components of GAD: a web interface, Perl modules and the database, which uses the Oracle RDBMS. The database has three layers; gene and disease data are organized into a large fact table in a middle layer with dimensional views on the top layer. The bottom layer contains the tools for adding, editing, batch loading and downloading data to and from the database.
We identify data fields common to genetic association studies, such as disease phenotypes, sample sizes, significance values, population information and allele descriptions. These fields are grouped into five views relevant to disease phenotypes (Disease View), gene-based molecular data (Gene View), chromosomal and mutation information (CH-SNP-Hap View), Reference View and All View. Table 1 shows a summary of the current contents in the database.
Query tools include key-word-search functions that permit field-specific searches, advanced combinatorial queries and pull-down selections of controlled vocabularies (Fig. 1). Batch searches are done against an aggregate table, allowing the user to input a list of genes (300) at once. In this way, batch results from high-throughput assays, such as microarrays, proteomic, cDNA sequencing and SAGE (serial analysis of gene expression), can be rapidly queried in the context of human disease associations.
Of particular interest are phenotypic descriptions captured at multiple levels. A top level 'disease class' is assigned, followed by 'disease' from the original paper. If studies recognize clinical subphenotypes, endophenotypes or intermediate phenotypes, this is noted in 'narrow phenotype'. Moreover, certain alleles have defined molecular characteristics and are noted under 'molecular phenotype'. These molecular and pathway variants may have a closer relationship to a polymorphism than to the end-stage complex phenotype, such as altered transcription due to a promoter polymorphism (IL6) or serum levels of ACE. Using this hierarchical phenotypic assignment makes it easier to consider molecular phenotypes in the context of end-stage disease. In some cases, although independent end-stage diseases may not share overt similarities at a clinical level, the genetic factors that contribute to those diseases may be shared at a molecular level8,9. The development of a hierarchy of phenotypes, from broad to specific, may allow classification of diseases, subphenotypes and molecular parameters of disease and their relationship to complex traits.
GAD is an archive of published genetic association studies that provides a comprehensive, public, web-based repository of molecular, clinical and study parameters for >5,000 human genetic association studies at this time. This approach will allow the systematic analysis of complex common human genetic disease in the context of modern high-throughput assay systems and current annotated molecular nomenclature.