Introduction

Candidate gene association studies have been widely used to study genetic susceptibility to complex diseases, including cancer.1 Critics of candidate gene studies have pointed to non-replication of results, false positives, insufficient sample sizes, and limited prior knowledge of biologically relevant candidate genes.2 These concerns have prompted the use of systematic reviews, especially meta-analyses of multiple studies, to minimize false-positive associations and assess the credibility of findings.3 In recent years, genome-wide association studies (GWAS) have greatly accelerated the pace of discovery and found many novel genetic associations that were not anticipated by the candidate gene approach.4, 5 Associations discovered by GWAS raise additional questions, particularly because observed effects are typically very small.6 Furthermore, the implicated SNPs represent markers that require further investigation to identify causal variants,7 although this may become less of a problem as methods for fine-mapping associations improve.

A critical evaluation of a decade’s worth of association studies is warranted as the next phase of cancer genetics research unfolds. In the present analysis, we used the data available in a 2008 paper published by Dong et al8 and in the Cancer Genome-wide Association and Meta Analyses database (Cancer GAMAdb)9 to complete a systematic review of genetic associations in cancer GWAS and meta-analyses and pooled analyses published over an 11-year period, from 2000–2011.

Materials and Methods

Cancer GAMAdb

To help consolidate the vast amount of information from both candidate gene and GWAS of cancer, the Centers for Disease Control and Prevention’s (CDC) Office of Public Health Genomics and the National Cancer Institute’s Division of Cancer Control and Population Sciences launched the Cancer GAMAdb in 2010.9 This continuously updated database catalogs published GWAS and meta-analyses and pooled analyses that have evaluated associations of genetic polymorphisms and cancer risk since January 1, 2000. Cancer GAMAdb builds on a published data set by Dong et al,8 which encompassed meta-analyses and pooled analyses of genetic polymorphisms, and cancer risk published until March 15, 2008. Associations in the database published after that date have been identified using the Human Genome Epidemiology (HuGE) Navigator database10 and the National Human Genome Research Institute (NHGRI) GWAS catalog.11 The Centers for Disease Control and Prevention’s HuGE Navigator is a continuously updated knowledge base in HuGE.10 The NHGRI GWAS catalog extracts data from GWAS publications.11 Genetic associations with cancer are selected from these two databases for curation in the Cancer GAMAdb. Data describing the association(s)—including study population, minor allele frequencies, and effect sizes—are manually extracted from each article and entered into the Cancer GAMAdb. The current analysis is based on the data that were included in Cancer GAMAdb as of February 26, 2011.

Selection criteria

We selected genetic associations for our analysis according to the schema in Figure 1. We excluded meta-analyses and pooled analyses with P-values of odds ratios (OR) ≥0.05 (if P-values were not reported, we calculated P-values as described in Dong et al8). We excluded meta-analyses with fewer than 500 total cases or based on less than two studies for the meta-analyses and pooled analyses (or if either was unknown). We standardized the gene names with the Human Genome Organisation gene symbol and the National Center for Biotechnology Information Entrez Gene GeneID, as well as the RefSNP accession ID (rs numbers) for variant names where possible. To fill in the missing gene names, we searched by variant name using HuGE Navigator’s Variant Name Mapper and the UCSC Genome Browser and collected region information if the variant was intergenic.

Figure 1
figure 1

Methodology used for inclusion of associations into the analyses.

Our analysis was limited to genetic associations with incident cancer of specified type; we excluded associations with other outcomes (eg, all cancers, precursor lesions, biomarkers, or survival). Associations with circulating levels of IGF1; human lymphocyte antigen (HLA) markers; high-penetrance genetic markers (eg, APC, BRCA1); and associations with HRAS1 (which have been questioned because of flawed genotyping methods) were also excluded.

When an association had been examined in multiple meta-analyses, we included only the most recent publication. We gave priority to the most recently reported overall association with a particular variant; however, if no significant overall association had been reported, we included the most recent subgroup-specific association. If a publication reported multiple significant contrasts (ie, results based on different genetic models) for the same variant, we included the contrast with the smallest P-value. When significant associations were found with the same variant in both meta-analyses and GWAS, we checked to be sure that they compared the same contrasts. Associations with combinations of two or more variants were considered unique, even if associations with the individual variants were also reported.

GWAS data were restricted to studies of incident cancer published before February 26, 2011. Studies were identified from the HuGE Navigator and checked against the NHGRI GWAS catalog to ensure completeness. In some instances, we checked the original publication for additional information; if we noticed data discrepancies between the GWAS catalog and the original paper, we used the data from the original paper. GWAS that included meta-analyses were classified as GWAS. Associations for which variants were not specified were excluded from analysis. When multiple GWAS reported the same association, we included only the most recently published study in our analysis.

Analysis strategy

Our analysis considered the extent to which associations reported in meta-analyses and GWAS overlapped. When both types of studies reported associations with the same variant, we called the overlap direct. When they reported associations with variants separated by less than 1 million base pairs, we called the overlap indirect. In an additional analysis, we also examined noteworthy associations, which we defined as those with false-positive report probabilities (FPRP) ≤0.2, a stringent threshold suggested by Wacholder et al,12 and used in the analysis by Dong et al.8 We calculated FPRPs at two levels of prior probability and at two levels of association (OR 1.5 and OR 1.2). As in the analysis by Dong et al,8 we chose to evaluate the associations using a low-prior probability of 0.001 (expected for a candidate gene) and a very low-prior probability of 0.000001 (expected for a random SNP). An association was considered noteworthy if it passed the FPRP threshold in one or more of these four categories.

Results

Significant associations are summarized in Table 1 by the cancer site and the study type.

Table 1 Number of significant associations (in variants and genes) reported in candidate gene meta-analysis and pooled analysis and GWAS, by cancer site

Meta- and pooled analyses

We identified 5131 gene-variant associations with incident cancer from 386 meta-analyses and pooled analyses published after the review by Dong et al review. We excluded 3828 (74.6%) associations because their reported P-values were ≥0.05; 1026 more were excluded for reasons listed in Figure 1. After applying all exclusion criteria, we found 277 significant associations; the review by Dong et al included 98 significant associations. Twenty-six (7.4%) of these were also found in meta-analyses published since the paper by Dong et al. Thus, there were 349 unique variant-cancer associations in all, involving 264 genes (76 with more than one associated variant) and spanning 25 different cancer types.

The largest number of candidate gene associations was found for breast cancer (n=80) followed by prostate cancer (n=53). Significant associations from meta-analyses and pooled analyses of candidate genes are listed in Supplementary Table 1.

Genome-wide association studies

We identified 4994 GWAS associations from 825 citations. We excluded 4645 associations with outcomes other than incident cancer and 80 for other reasons listed in Figure 1. In the end, there were 269 unique associations in 223 different genes with 21 different cancer types. The largest number of GWAS associations was found for prostate cancer (n=56) followed by breast cancer (n=36). Variants from GWAS are listed in Supplementary Table 2.

Combined

The combined results from candidate gene meta-analyses and GWAS included 577 unique associations of 446 different genes or chromosomal regions with 32 cancers. When we considered only direct overlap, we found 41 associations that had been reported in both meta-analyses and GWAS (Supplementary Table 3). The largest number of such associations was with prostate cancer (n=25), followed by breast cancer (n=8).

When we restricted our analysis to noteworthy associations (calculated FPRPs of ≤0.2 in either prior probability or OR) and allowed for direct and indirect overlap (both within and between study types), we found 202 unique associations in all. Of these, 66 were from candidate gene studies and 163 were from GWAS; 27 (13%) of these were found in both meta-analyses and GWAS (Table 2). We were unable to evaluate 38 GWAS associations for noteworthiness because the original publications did not report ORs and CIs. Allowing for indirect overlap, we found the largest numbers of noteworthy associations in leukemia (n=27), followed by prostate cancer (n=25). Noteworthy associations that were found only in meta-analyses (n=39) are listed in Table 3. All noteworthy associations are included in Supplementary Table 4.

Table 2 Number of noteworthy associations reported in candidate gene meta-analyses and pooled analyses and GWAS, accounting for direct and indirect overlap, by cancer site
Table 3 Noteworthy associations only found in meta- and pooled analyses of candidate gene studies

Meta-analyses and GWAS that examined the same variants (direct overlap) reported very similar ORs (Figure 2). All but three associations had ORs between 1.00 and 1.50. The largest effect sizes were observed for esophageal cancer and ALDH2 rs671 (heterozygous) in both meta-analysis (OR=2.52) and GWAS (OR=3.48).

Figure 2
figure 2

Odds ratios of variants common to candidate gene meta-analyses or pooled analyses (MA) and GWAS excluding ALDH2 in esophageal cancer (meta-analysis OR=2.52, GWAS OR=3.48).

Discussion

We summarized the principal findings from a decade of published genetic associations with incident cancer. We found that meta-analyses and pooled analyses of candidate gene studies had identified 349 statistically significant associations and GWAS identified 269. Very few associations were found in both groups; however, variant-cancer associations that were reported in both meta-analyses and GWAS had comparable effect sizes.

When we stratified on the basis of cancer type, there was considerable variation in the relative numbers of associations identified by meta-analyses and GWAS. For example, meta-analysis of candidate genes identified 80 breast cancer variants, versus 36 identified by GWAS. In contrast, meta-analysis found only four leukemia variants, compared with 32 identified by GWAS. The difference in the number of significant associations between the meta-analyses of candidate gene studies and GWAS could reflect variations in research interest, prevalence, or underlying knowledge of pathogenesis of different cancers.

Candidate gene studies and GWAS use different thresholds to define statistical significance. We used a P-value threshold of 0.05 for candidate gene studies and 1.0 × 10−5 for GWAS; the latter is used by the NHGRI GWAS Catalog, although 5 × 10−8 is more widely accepted in the literature today. These thresholds are consistent with those used in the original studies; however, it has been suggested that P=0.05 may be too lenient for candidate genes studies13 and P=5 × 10−8 may be too stringent for GWAS.14, 15 If this is true, then differences in the number of significant associations could also reflect an excess of false-positive findings from candidate gene studies13 and an excess of false negatives from GWAS.14 It has been suggested that in GWAS, where false negatives outnumber false positives, lowering the significance threshold to 10−7 would yield mostly genuine discoveries.15 Others have suggested that 10−7 be held as the criterion for early commercial genotyping arrays, but the standard 5 × 10−8 for current or merged commercial arrays.16 False-positive findings due to too lenient thresholds may be particularly pertinent for candidate gene studies that examine several variants and do not correct for multiple testing.

We identified noteworthy associations by calculating FPRPs as described by Wacholder et al.12 The FPRP for a genetic association takes into account not only the observed P-value but also the prior probability of the association and the statistical power of the test. We found 189 noteworthy associations in addition to the 13 previously reported by Dong et al.8 Most of these noteworthy associations were identified in GWAS; however, 39 were found exclusively in meta-analyses of candidate gene associations.

Meta-analysis of candidate gene association studies diminishes, but does not entirely exclude, random error and bias as causes of false-positive associations. GWAS also have challenges; in particular, the actual fraction of the genome interrogated in a GWAS varies with the genotyping platform and study population.17 Although imputation methods may help increase the genomic coverage, they are not perfect, especially for variants of lower frequency. For example, in an attempt to unify candidate gene and GWAS approaches in asthma, Michel et al18 found that GWAS coverage was insufficient for many asthma candidate genes.

More in-depth analysis in future studies could further elucidate why 39 candidate gene associations did not reproduce in GWAS. Insufficient power due to the limited ability of GWAS to detect rare variants may have a role. Candidate gene studies are not suited for the study of exceptionally rare variants either, not without incredibly large sample sizes. Uncommon variants, however, which include CHEK2, may still have frequencies too low to be detected through GWAS.4 The CHEK2 1100delC mutation, an established genetic risk factor for breast cancer, was found in 0.7% of cases and 0.4% of controls in a Swedish study population.19 Despite many GWAS conducted in breast cancer, CHEK2 has not passed the 1 × 10−5 threshold (as reported by the NHGRI GWAS Catalog).11 It is important to add, however, that the mutation was not discovered by candidate gene methods but by studying families with Li–Fraumeni syndrome.20 As in candidate gene studies, inadequate sample size should also be considered as a possible source of insufficient power. Significant positive correlations have been noted between the number of novel SNPs detected and the sample size of GWAS.21

In our study, the 41 associations common to both meta-analysis and GWAS had effect sizes that were generally similar and mostly small. A notable outlier is the association of ALDH2 rs671 risk for esophageal cancer, which has been described by three meta-analyses and one GWAS since 2000. ALDH2 encodes a key enzyme in the metabolism of consumed alcohol, which is a major epidemiologic risk factor for esophageal cancer. A 2009 paper by Khoury and Wacholder notes that very few association studies have considered gene–environment interactions, and that incorporating both genetic and environmental factors in the analysis may be one path to finding additional associations and larger effect sizes but may require extremely large sample sizes to achieve sufficient power.22 Other methodological challenges unique to genome-wide environmental interaction studies exist, which can perhaps explain the low number of publications in this field.23

Our analysis had some limitations. By considering only meta-analyses of candidate genetic associations, we could have left out some recent individual candidate gene studies with sufficiently large sample sizes to find noteworthy associations. By considering only the main associations in candidate gene meta-analyses, we could have overlooked important subgroup associations, such as some that seem to be race- or ethnicity-specific associations.24 We also did not use linkage disequilibrium between markers when defining indirect overlap but relied on physical distance. It is known that linkage disequilibrium and physical distance are correlated.25 Markers that are located close to each other generally exhibit higher linkage disequilibrium than those that are located further apart. Although a distance of 1 Mb may be considered large for identifying overlapping associations, at least one GWAS has traced an association to a causal variant located at roughly this distance away from the original signal.26 It should also be noted that reducing this distance would not change our conclusion that there is limited overlap between the two study types. Finally, we attempted to avoid duplication of cases by limiting our analysis to only the most recent meta-analysis and GWAS for each association. Nevertheless, this possibility cannot be completely excluded, especially because GWAS are often assembled from previously ascertained groups of cases and controls.

One criticism of candidate gene studies is that most genetic associations are not replicated in subsequent studies.2 Similar to GWAS, findings from the candidate gene studies must demonstrate replication to be considered valid. Meta-analysis of the published literature is an important tool in assessing the cumulative evidence on genetic associations.27 Consortia offer another approach to meta-analysis that may help protect against the effects of selective reporting and publication bias. In a study comparing meta-analyses of individual case-control studies with consortium analyses in breast cancer, the authors concluded that meta-analyses and consortia-wide analyses were complementary.28 Consortium-based analyses may be particularly useful for detecting variants modified by weak-to-moderate gene–environment interactions.29 Meta-analysis has also become increasingly popular in GWAS,30 where it can aid in exploring the heterogeneity across data sets and identifying more disease-related genes.31 In 2011, there were 173 publications on meta-analyses and pooled analyses of candidate genes in cancer, and 39 GWAS, of which 6 included a meta-analysis.9

In light of improved genetic sequencing technologies, some discussions on the future roles of GWAS and candidate gene studies are appropriate. One of the limitations of current GWAS technology is its limited ability to detect low-frequency variants. A study by Siu et al32 found that GWAS coverage of rare variants was still inadequate despite using chips designed to detect them. In addition, the quality of imputed low-frequency and, especially, rare variants in these studies is generally lower than that for common variants.33 Still, arrays and reference panels have improved much since the advent of GWAS, the most recent of which was not included in our analysis. It has been estimated that previous GWAS have detected less than 20% of all independent GWAS-detectable SNPs in chronic diseases, but future GWAS can potentially detect more SNPs through improved coverage and, especially, sample sizes.21

Studies that use recently developed arrays such as MetaboChip,34 ImmunoChip,35 and iCOGS array36 represent the latest reinvention of the candidate gene study. These chips can contain hundreds of thousands of SNPs that were chosen for replicating and fine-mapping loci identified from GWAS, as well as to cover the most promising candidate genes. A recent consortium-based meta-analysis that used the iCOGS array identified 23 new prostate cancer susceptibility loci.37 Next-generation sequencing is also increasingly helping to improve the understanding of genetic association studies.32 Projects such as ENCODE are likely to provide new insights into GWAS associations in non-coding regions of the genome.38 Together, these multiple approaches will help us identify additional genetic associations and understand their functional implications.