This page has been archived and is no longer updated

 

Human Genomic Epidemology: HuGENet

By: Leslie A. Pray, Ph.D. © 2008 Nature Education 
Citation: Pray, L. (2008) Human genomic epidemology: HuGENet. Nature Education 1(1):62
Email
What should be done to manage the massive amount of data from genomic epidemiology studies? The CDC established HuGENet, the Human Genome Epidemiology Network, to keep track of it all.
Aa Aa Aa

 

Human genomic epidemiology is a rapidly growing field focused on understanding how genomic variation influences human health. Although epidemiology has its origins in eighteenth-century Europe, genetic epidemiology is a decidedly twenty-first century science. Over the past several years, researchers have published more than 32,000 articles in the peer-reviewed scientific literature regarding the genomic epidemiology of nearly 2,000 different diseases (Figure 1; Yu et al., 2008). That's a lot of data. Thus, in 2001, the Human Genome Epidemiology Network (HuGENet), a global collaborative effort housed in the U.S. Centers for Disease Control (CDC), established the HuGE Published Literature database, or HuGE Pub Lit, in an effort to keep track of these data. This database is 100% open access, meaning that researchers anywhere in the world can freely retrieve the information within it. But what comes out of a database is only as good as what goes in. Indeed, in 2008, scientists from the CDC and their collaborators demonstrated that most genomic epidemiology studies published in the peer-reviewed literature (and therefore included in HuGE Pub Lit) are missing critical information that would help readers better judge the evidence they present (Yesupriya et al., 2008).

Genomic Epidemiology: Cataloging the Data

The CDC defines epidemiology as "the study of the distribution and determinants of disease or health in a population; the study of the occurrence and causes of health effects in humans" (Centers for Disease Control and Prevention, 2006). Epidemiologists conduct large population-level studies in an attempt to associate human diseases with potential causal factors. These studies are different from genome-wide association studies, or GWAS, although epidemiological studies can certainly encompass genome-wide analyses.

In comparison to general epidemiologists, genomic epidemiologists are particularly interested in associations between genetic variation and human health and disease. As an example of a genomic epidemiological study, consider BRCA1 and BRCA2, two genes that have been associated with breast cancer. Scientists know that women with mutations in either gene have a 35%-85% chance of developing breast cancer at some point in their lives, with the range of estimates reflecting the fact that each woman's chance of developing breast cancer is determined by a multitude of genetic and environmental factors and not by BRCA alone. Genomic epidemiologists derived these estimates by specifying a population, determining the number of women with breast cancer and the number of women with mutations in either gene in that population, and then conducting the appropriate statistical calculations.

In their research, genomic epidemiologists frequently rely on the HuGE Pub Lit database, which is a searchable online database of all the peer-reviewed, population-based epidemiological studies of human genes published since October 1, 2000, with (at least) an English abstract, and included in the U.S. National Library of Medicine's online database, PubMed. HuGE Pub Lit is updated every Thursday as part of the CDC's online Genomics and Health Weekly Update. For this update, PubMed is automatically queried for relevant titles and abstracts; a HuGENet curator then goes through the articles, selects those appropriate for inclusion in the database, and indexes the selected articles in the database. As of May 28, 2008, HuGE Pub Lit included 34,470 articles related to 1,935 diseases and 3,541 different genes.

There are six tools available for accessing the HuGE Pub Lit database. The tools are collectively known as the HuGE Navigator, and they are as follows:

  • HuGE Literature Finder is a search engine for finding published literature on human genome epidemiology.
  • HuGE Investigator Browser is a search engine for finding researchers or collaborators in particular fields of human genome epidemiology.
  • Gene Prospector is a search engine for finding published evidence on diseases and risk factors associated with particular genes.
  • HuGE Watch is a tool for tracking publication trends.
  • HuGE Risk Translator is a tool for calculating the predictive value of genetic markers (i.e., the "disease risk").
  • HuGEpedia is an online encyclopedia that summarizes gene-disease associations; viewers can search by disease or by gene.

So, for example, if you want to find genomic epidemiological studies on BRCA1 and breast cancer, all you need to do is go to HuGE Literature Finder, type in something like "BRCA1 and breast cancer," and there you have it—you should see a list of over 400 genomic epidemiological studies on BRCA1 and breast cancer.

Criticism of the HuGENet Database

It remains to be seen how effective HuGENet's database and the HuGE Navigator will truly be and how well they will serve as a research resource for genomic epidemiologists. So far, most of the few peer-reviewed articles that refer to the database are descriptive articles written by HuGENet collaborators.

One of the first nondescriptive, critical studies of the database was a May 2008 paper published in the journal BMC Medical Research Methodology (Yesupriya et al., 2008). However, this study doesn't really critique the HuGE Navigator system per se; rather, it is critical of the way that researchers are communicating information in published genomic epidemiological reports and the way that data that are entered into the system. In the 2008 paper, scientists from the CDC, the University of Ioannina (Greece), the University of Wisconsin, and the March of Dimes Birth Defects Foundation demonstrate how most published papers that contain genomic epidemiology results (and are therefore included in the HuGENet database) are missing critical information, thus making it difficult to judge the validity of the evidence.

For their paper, the scientists randomly selected two sets of articles from the HuGE Pub Lit database: 315 articles from 2001 through 2003, and 28 from 2006. The scientists read the articles and recorded a number of items from each article considered key to assessing the validity of the genomic epidemiology findings. These items included whether genotyping results were validated with the use of replicate samples; whether statistical power was reported in the article; whether gene-environment interactions were measured or even discussed; and whether study participants were drawn from the same ethnic population or ancestry (and whether the article even provided clear information on ethnicity or ancestry).

Of these potential problems, the lack of replication and validation is arguably most significant. After all, only when a scientific finding is replicated in multiple studies does it truly become "scientific knowledge." Even then, scientific findings are always open to scrutiny. Moreover, in genomic epidemiology, replication is crucial to ensure that any observed association is not due to chance but to an actual biological correlation. There is always a possibility that the results of a study will be "false positive," meaning that they show a positive association even though no association actually exists. The greater the number of replicates, the more likely the observed association (if there is one) reflects a true biological association. By the same token, it is also important to report the statistical power of a study because this provides readers with some indication of the likelihood of results being "false negative," meaning that no association has been detected even though there really is a true biological association between the gene and disease in question. The higher the power, the less likely a result is "false negative."

Gene-environment interactions are also important considerations in these types of studies, because most human diseases are caused by a multitude of interacting genetic and environmental factors (indeed, that is why epidemiological studies are done, to tease apart these factors). A gene-disease association found in one set of genetic or environmental circumstances might not hold up in different circumstances. A measure of gene-environment interaction is therefore critical, because it gives readers a sense of the extent to which this is true. Ethnicity is another important factor to consider because when scientists find that a gene increases the risk of disease in one ethnic population, this does not necessarily mean that the same gene increases disease risk in a different population.

In the Yesupriya et al. (2008) paper, after tabulating the results for each set of articles, the scientists compared the two time periods to see whether researchers were becoming more conscientious when reporting the results of genome epidemiology association studies. The authors concluded that, while there were some minor improvements over time (e.g., the later studies were more likely to report statistical power and whether genotype results were validated using duplicate samples), most genomic epidemiology association studies still did not provide sufficient information for readers or reviewers to independently evaluate the findings.

One of the more dramatic findings outlined in this paper is that most of the studies analyzed in both time periods were small, with only about 10% having sample sizes greater than 1,000. Small sample size usually translates into very low statistical power, as many gene-phenotype associations require thousands, if not tens of thousands, of individuals to identify and validate. Many of the studies also misclassified information, omitted important information, or failed to make necessary statistical calculations in certain situations.

The results from the 2008 BMC Medical Research Methodology study highlight the need for more consistency and transparency in scientific reporting so that readers can better judge the validity of scientific evidence and actually use it in their own analyses. After all, the HuGENet database was created to make this fast-growing mountain of evidence accessible and useful to all and to advance the science of human genomic epidemiology. HuGENet potentially serves a valuable scientific purpose, and genomic epidemiology remains a fast-growing and important area of scientific research. But if the information extracted from the database is only as good as what was entered, then what are the implications for how the database will (and can) be used? The lesson here is one for all scientists, not just genomic epidemiologists: Communication is key. Although you may have conducted a brilliant research study, of what use is the study if no one can interpret your results?

References and Recommended Reading


Centers for Disease Control and Prevention. Reproductive Health: Glossary (2006)

Yesupriya, A., et al. Reporting of Human Genome Epidemiology (HuGE) association studies: An empirical assessment. BMC Medical Research Methodology 8, 31 (2008)

Yu, W., et al. GAPscreener: An automatic tool for screening human genetic association literature in PubMed using the support vector machine technique. BMC Bioinformatics 9, 205 (2008)

———. HuGE Watch: Tracking trends and patterns of published studies of genetic association and human genome epidemiology in near-real time. European Journal of Human Genetics 16, 1155–1158 (2008)

Email

Article History

Close

Flag Inappropriate

This content is currently under construction.

Connect
Connect Send a message


Scitable by Nature Education Nature Education Home Learn More About Faculty Page Students Page Feedback



Population and Quantitative Genetics

Visual Browse

Close