Correspondence | Published:

The Genetic Association Database

Nature Genetics volume 36, pages 431432 (2004) | Download Citation

Subjects

To the editor:

The increasing availability of polymorphism data has allowed more gene association studies to be carried out and the number of published genetic association studies is growing rapidly. Studies done secondarily to successful linkage studies over the last decade have also fueled the increase in published association studies. Although there are single-nucleotide polymorphism and human variation databases1,2, there is currently no public repository for genetic association data. It is difficult to query association data in a systematic manner or to integrate association data with other molecular databases. OMIM3, the main repository of genetic information for mendelian disorders, is largely text based and is of a historical narrative design, making it difficult to compare large sets of molecular data. Moreover, OMIM archives mature, high-quality data of high significance, the standard in rare mendelian disorders. Although this data is useful, OMIM does not routinely collect findings of lower significance or negative findings. The study of nonmendelian, common complex disorders is often a struggle to find disease relevance with lower significance values, and often conflicting evidence. Negative data are often not reported or are marginalized into obscure and less accessible scientific journals, resulting in a publication bias favoring positive genetic associations4. Here, we describe the development of a genetic association database (GAD; http://geneticassociationdb.nih.gov) that aims to collect, standardize and archive genetic association study data and to make it easily accessible to the scientific community.

There are no standards for designing, implementing, interpreting or reporting association studies (e.g., sample size, replication, significant P values), although guidelines have been suggested4,5,6,7. The literature is filled with alternative, idiosyncratic and arbitrary gene names and gene symbols, as well as a continuum of phenotypic descriptions. Studies using arbitrary nomenclature continue to be published, making cross-comparison and meta-analysis difficult. One goal of GAD is to standardize molecular nomenclature in the archival process by including official HUGO gene symbols. After this assignment, each record is annotated with links to molecular databases (LocusLink, GeneCards, HapMap, etc.) and reference databases (PubMed, CDC), among others. Once they are standardized, integrating association data with other molecular databases, data mining tools, annotation and future sources of molecular data (e.g., gene interactions, quantitative trait loci) can be done systematically. Moreover, cross-comparison and meta-analysis of studies becomes more efficient.

There are three main components of GAD: a web interface, Perl modules and the database, which uses the Oracle RDBMS. The database has three layers; gene and disease data are organized into a large fact table in a middle layer with dimensional views on the top layer. The bottom layer contains the tools for adding, editing, batch loading and downloading data to and from the database.

We identify data fields common to genetic association studies, such as disease phenotypes, sample sizes, significance values, population information and allele descriptions. These fields are grouped into five views relevant to disease phenotypes (Disease View), gene-based molecular data (Gene View), chromosomal and mutation information (CH-SNP-Hap View), Reference View and All View. Table 1 shows a summary of the current contents in the database.

Table 1: Current contents of the GAD

Query tools include key-word-search functions that permit field-specific searches, advanced combinatorial queries and pull-down selections of controlled vocabularies (Fig. 1). Batch searches are done against an aggregate table, allowing the user to input a list of genes (300) at once. In this way, batch results from high-throughput assays, such as microarrays, proteomic, cDNA sequencing and SAGE (serial analysis of gene expression), can be rapidly queried in the context of human disease associations.

Figure 1: A simple search of positive associations for the disease schizophrenia.
Figure 1

Fields in this view include Official Gene Symbol, Disease Phenotype, Disease Class, Chromosome, Chromosome Band, Genomic DNA Position, P Value, Reference, PubMed ID and Allele.

Of particular interest are phenotypic descriptions captured at multiple levels. A top level 'disease class' is assigned, followed by 'disease' from the original paper. If studies recognize clinical subphenotypes, endophenotypes or intermediate phenotypes, this is noted in 'narrow phenotype'. Moreover, certain alleles have defined molecular characteristics and are noted under 'molecular phenotype'. These molecular and pathway variants may have a closer relationship to a polymorphism than to the end-stage complex phenotype, such as altered transcription due to a promoter polymorphism (IL6) or serum levels of ACE. Using this hierarchical phenotypic assignment makes it easier to consider molecular phenotypes in the context of end-stage disease. In some cases, although independent end-stage diseases may not share overt similarities at a clinical level, the genetic factors that contribute to those diseases may be shared at a molecular level8,9. The development of a hierarchy of phenotypes, from broad to specific, may allow classification of diseases, subphenotypes and molecular parameters of disease and their relationship to complex traits.

GAD is an archive of published genetic association studies that provides a comprehensive, public, web-based repository of molecular, clinical and study parameters for >5,000 human genetic association studies at this time. This approach will allow the systematic analysis of complex common human genetic disease in the context of modern high-throughput assay systems and current annotated molecular nomenclature.

References

  1. 1.

    & Nucleic Acids Res. 31, 124– 127 (2003).

  2. 2.

    et al. Nucleic Acids Res. 29, 308– 311 (2001).

  3. 3.

    et al. Nucleic Acids Res. 30, 52– 55 (2002).

  4. 4.

    , & Hum. Genet. 110, 207– 208 (2002).

  5. 5.

    Anonymous. Nat. Genet. 22, 1– 2 (1999).

  6. 6.

    et al. Nat. Genet. 30, 149– 150 (2002).

  7. 7.

    , & Nat. Genet. 36, 3 (2004).

  8. 8.

    et al. Nature 427, 636– 640 (2004).

  9. 9.

    Med. Hypotheses 62, 309– 317 (2004)

Download references

Author information

Affiliations

  1. Gene Expression and Genomics Unit, 333 Cassell Drive, National Institute on Aging, National Institutes of Health, Baltimore, Maryland 21224, USA.

    • Kevin G Becker
    •  & Tiffani J Bright
  2. Johns Hopkins Asthma and Allergy Center, Johns Hopkins University, 5501 Hopkins Bayview Circle, Baltimore, Maryland 21224, USA.

    • Kathleen C Barnes
  3. Division of Computational Bioscience, Center for Information Technology, National Institutes of Health, Bethesda, Maryland 20892, USA.

    • S Alex Wang

Authors

  1. Search for Kevin G Becker in:

  2. Search for Kathleen C Barnes in:

  3. Search for Tiffani J Bright in:

  4. Search for S Alex Wang in:

Corresponding author

Correspondence to Kevin G Becker.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/ng0504-431

Further reading

Newsletter Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing