Introduction

Genetic association studies (GASs), or candidate–gene studies, assess the association between disease status and genetic variants (gene polymorphisms) in a population. They have been particularly popular for investigating complex diseases and the number of papers on GASs has increased tremendously. This trend is expected to accelerate because of the rapidly increasing availability of mapped single-nucleotide polymorphisms (SNPs) and advances in genotyping technologies (Donahue and Allen 2005). Dealing with all of the accumulated evidence, especially when studies frequently produce controversial or inconclusive results, is a major challenge. Meta-analysis provides a tool to estimate population-wide genetic risk effects for pertinent gene–disease associations, therefore, helping to resolve contradictory results and to decrease the uncertainty of the estimated risk. Since 2002, the molecular research of genotype–phenotype associations has entered the genome-wide era, as a consequence of the completion of the Human Genome and HapMap projects and the development of ultra-high-volume genotyping chip technologies. In contrast to GASs, the genome-wide association studies (GWASs) are hypothesis-free (i.e., unbiased) and involve a massive scan of the genome with a dense set of SNPs (up to 500,000) in a single experiment, in search of causal variants. Thus, GWASs represent a comprehensive option to GASs when there is a lack of evidence regarding the function or the location of the causal genes. However, the vast amount of data produced by GWASs represent a major methodological challenge both in their primary analysis and in their meta-analysis (Thomas and Witte 2002; Hirschhorn and Daly 2005; Zintzaras and Lau 2007).

Accumulation of evidence

A search in HuGENet (http://www.cdc.gov/genomics/hugenet/default.htm) conducted for the period January 2000 to December 2006 found 535, 887, 351, 694, and 249 GASs published in hypertension (Zintzaras et al. 2006a, 2006b), schizophrenia (Zintzaras 2006a, 2006b, 2007), Parkinson’s disease (Zintzaras and Hadjigeorgiou 2004, 2005), breast cancer (Zintzaras 2006a, 2006b), and alcoholism (Zintzaras et al. 2006a, 2006b), respectively (see Fig. 1). These five topics are among many other common complex diseases with a major impact in public health, and whose underlining pathogenetic mechanisms are not clearly understood. In breast cancer and Parkinson’s disease, the number of studies published each year after 2002 is consistently above 100 and 50 studies, respectively, and in schizophrenia, there is an upward trend, with an average growth rate of 75% per year. In hypertension and alcoholism, the annual number of published studies on average is 76 and 35, respectively. Therefore, the same or even a larger number of publications of GASs in these topics in the coming years can be expected, and the pattern of published GASs is likely to be similar for other topics.

Fig. 1
figure 1

Cumulative frequency of published genetic association studies in HuGENet for five main complex diseases

A search in PubMed for GWASs (August 2007) has yielded 36 articles describing 42 studies for various multifactorial diseases, such as coronary artery disease, Crohn’s disease, Parkinson’s disease, age-related macular degeneration, and Alzheimer’s disease. Moreover, an increasing number of funded NIH and European initiatives have been undertaken in this direction during the last few years, leading to a growing number of forthcoming and promising studies (Thomas and Witte 2002).

Validity of genotype–phenotype associations

The published literature has demonstrated a plethora of questionable gene–disease associations, based on both the candidate gene and the genome-wide approach, the replication of which has often failed in independent studies (NCI-NHGRI Working Group on Replication in Association Studies 2007). Although replication is essential for establishing the credibility of a genotype–phenotype association, a number of methodological and design issues in the initial studies are hampering research.

Small sample size is a frequent problem and can result in insufficient power to detect minor contributing roles of one or more alleles. The most realistic genetic association between a polymorphic locus and a disease has been claimed to yield an odds ratio of between 1.1 and 1.5. Therefore, to achieve a satisfactory power (>80%) to identify a modest genetic effect (odds ratio 1.2) of a polymorphism present in 10% of individuals, a sample size of 10,000 subjects or more would be needed for a GAS. Likewise, for GWASs, testing for 500,000 SNP associations in a case–control study at a 5% significant level, a Bonferroni correction would require significance at p = 0.05/500,000 = 10−7. Then, to attain a 95% power for OR = 1.2 and minor allele frequency of 10%, 15,000 case–control pairs are required in a single study, leading to a significant financial burden (Thomas and Witte 2002; Wang et al. 2005). The meta-analysis of multiple studies clearly has a role in offering an analysis with the potential for higher probability to detect significant results (Munafò and Flint 2004).

Population stratification can be a confounding factor in gene–disease associations and arises when differences in the genetic structure of the underlying population are not taken into account (Zintzaras and Sakelaridis 2007). Then, the cases and controls are not matched for their genetic background, which can lead to biased or spurious results (Cardon and Palmer 2003). Moreover, the quality design issues of individual studies, such as the definition of a phenotype, validity of the genotyping method, and heterogeneity in exposure to environmental challenges, can increase the risk of biases (Zintzaras and Stefanidis 2005; Zintzaras et al. 2007). A meta-analysis provides a robust tool to investigate discrepant results, to decrease the uncertainty of the effect size of estimated risk, and to explore the heterogeneity between studies (Zintzaras and Ioannidis 2005; Zintzaras and Lau 2007).

The role of meta-analysis

The number of meta-analyses appearing in HuGENet is 9, 36, 10, 19, and 7, for hypertension, schizophrenia, Parkinson’s disease, breast cancer, and alcoholism, respectively. These meta-analyses can provide answers to the question of whether there is evidence of an association between gene polymorphism and disease. Table 1 shows the meta-analyses’ summary risk effects (odds ratios) for investigating associations between various gene polymorphisms and the five diseases of interest. The meta-analyses were based on all of the available studies at the time of performing the meta-analysis. In total, 96 gene polymorphisms were examined, and the number of studies included in the meta-analyses ranged from 2 to 48. Significant associations under any genetic contrast were found for four polymorphisms in hypertension, 26 in schizophrenia, four in Parkinson’s disease, 11 in breast cancer, and six in alcoholism. An association was considered to be significant when the p-value was less than 0.05 or the 95% confidence interval of the odds ratio did not include 1.0. So far (PubMed accessed August 2007), one meta-analysis in the field of GWASs for Parkinson’s disease has been published, synthesizing only three studies (Evangelou et al. 2007), but there is still a need for establishing a proper methodology for the meta-analysis of genome-wide rich data (Zintzaras and Ioannidis 2007).

Table 1 Results of 81 meta-analyses of genetic association studies shown in HuGENet. The conclusive pooled odds ratio (OR) and the corresponding 95% confidence interval (CI) as provided by individual meta-analyses are shown. When the OR was not applicable, the p-value for testing the association is shown

The methodological issues relating to meta-analyses have been previously described in detail (Munafò and Flint 2004; Zintzaras and Lau 2007), and are beyond the scope of this paper. However, the benefit of getting an assessment of the overall risk effects from a meta-analysis is obvious. As evidence is accumulating rapidly, the updating of genetic risk effects can provide information on whether an association is real, or that more evidence is needed in order to draw reliable conclusions on the association (Neale and Sham 2004). However, a meta-analysis requires a large amount of labor and effort, since it involves systematic searches in databases, article retrieval, data extraction, data entry, and data analysis (Hirschhorn et al. 2002).

Obstacles in meta-analysis

The major obstacle in undertaking a meta-analysis of GASs is the structure and diversity of stored information in databases, such as the Genetic Association Database (http://geneticassociationdb.nih.gov/cgi-bin/index.cgi), PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed), EMBASE (http://www.embase.com/), HuGENet, and OMIM (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM). The information provided in the databases is not structured and standardized appropriately for meta-analysis use, and they are not comprehensive. These databases, though important for promoting research, do not claim to be and are not a substitute for the meta-analysis of GASs. Thus, gathering and meta-analyzing data related to an SNP association with a disease/disorder is not straightforward. The minimum information that the articles should provide for meta-analysis is the genotype frequencies, the ‘race’ of samples, the study design, the demographic characteristics, and possible effect modifiers or confounders. In addition, the literature is filled with alternative, idiosyncratic, and arbitral gene and SNP names and symbols, making cross-comparison and meta-analysis difficult [e.g., an SNP can be found with more than one arbitral name, designating amino acid or nucleotide substitutions or else, and rarely with the official dbSNP ID rs-number (http://www.ncbi.nlm.nih.gov/projects/SNP/)] (Kitsios and Zintzaras 2007; Zdoukopoulos and Zintzaras 2007). Much of the information that the biological researcher is interested in is available in public reference databases and in the millions of articles of the scientific research literature, mostly accessible via the Internet. It is estimated that about 80% of biological data are in text form, and even the abstracts are written in free text utilizing a complex biological vocabulary, which may vary significantly in different areas of research. Thus, despite their wide availability, these data are not generally machine-searchable. Consequently, no promising or significant increase in the efficiency of data integration may be expected, for example, from the automation of PubMed data retrieval (Teufel et al. 2006; Lacroix 2002; Colhoun 2003). Regarding GWASs, no publicly available database is currently storing the enormous amount of accumulating data.

Improving the quality of reporting genotype–phenotype association

The situation of GASs is similar to the problem clinical medicine faced more than 10 years ago, where the large number of randomized control trials of varying quality befuddled clinicians (Lau et al. 1992). Then, concerted efforts by the medical community were initiated to improve the quality of reporting research publications. For example, the CONSORT statement (Begg et al. 1996) was published with a view to improving the quality of the reporting of randomized controlled trials, the QUOROM statement (Moher et al. 1999) was focused on the reporting of meta-analyses, and the STARD initiative (Bossuyt et al. 2003) was developed to improve the reporting of studies of diagnostic accuracy. The Cochrane Collaboration was created in response to the need for collecting, synthesizing, and disseminating the effect of health care in prevention. This is a successful model that should be emulated by other scientific disciplines that have similar needs to synthesize evidence.

The need for data sharing has been highlighted by the Genetic Association Information Network (GAIN) initiative, in an effort to facilitate the subsequent and joint analysis of GWASs data (Thomas and Witte 2002). All genotypes will be made public as they are generated and checked for quality. However, the sheer scale of GWAS data will pose significant practical challenges regarding the form of the stored statistical results and their suitability for meta-analysis.

Conclusion

Given the rapid accumulation of evidence regarding gene–disease associations, a Web-based system for data storage and automated meta-analysis of genetic association studies (GASs) and genome-wide association studies (GWASs) results is essential to: (1) keep track of the evidence for gene–disease associations, (2) improve the quality of reporting, and (3) reduce effort and labor for performing meta-analysis.