Introduction

Genome-wide association studies using large samples of cases and controls have been successful in identifying many genes involved in ‘common’ diseases. For breast cancer at least 20 genes or genomic regions have been found to be associated with the disease using these methods.1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 Rarer forms of variation such as disease variants in the BRCA1 and BRCA2 genes and moderate penetrance genes found by targeted candidate studies bring the total number of genes implicated in breast cancer to about 30. However, collectively these only account for ∼30% of the disease genetic variance, suggesting many more genes remain to be discovered. There are several possible sources for the ‘missing heritability’ including genes missed because of low statistical power and uneven marker coverage, heterogeneity through sub-phenotypes associated with different genetic variants and rare genetic variation (as only common variants are screened by existing single nucleotide polymorphism (SNP) panels). One route to identifying new targets is through meta-analysis, which can increase the power to detect novel disease variation for further follow-up by combining data from independent samples. Zeggini et al.13 describe meta-analysis of genome-wide association studies combining evidence for individual single SNPs using genotyped and imputed data. We develop and apply a composite likelihood-based method, which models information from multiple SNPs in a given genomic region and combines evidence across corresponding regions in independent data sets. Modelling association with disease on an underlying linkage disequilibrium (LD) map further increases power and resolution of mapping, compared with single SNP tests.14, 15 This approach also partitions the genome into regions, which contain equivalent levels of LD. As only one statistical test is made in each region, there is no increased multiple-testing penalty with greater marker density. The three data samples analysed here are from the Cancer Genetic Markers of Susceptibility Project (CGEMS),1 which comprises post-menopausal breast cancer samples, a sample from the prospective study of outcomes in sporadic versus hereditary breast cancer (POSH) cohort of early-onset breast cancer,4, 16 and data from the Wellcome Trust Case Control Consortium (WTCCC).17 The first two data sets have genome-wide SNPs whereas the latter comprises genotypes for non-synonymous coding SNPs only. The three samples therefore show a high degree of heterogeneity in both phenotype and marker coverage and thus present a challenge for meta-analysis. Previous findings from these data sets include primary evidence for association of the FGFR2 gene with breast cancer,1 whereas the POSH sample formed part of a larger study that determined novel breast cancer genes.4 Our analysis examines the evidence for breast cancer genes through composite likelihood-based meta-analysis in these three samples.

Materials and methods

Data preparation and quality control

We undertook the following procedures for the three data samples analysed in the meta-analysis:

CGEMS sample

The CGEMS sample (http://cgems.cancer.gov/) comprises data from 1145 post-menopausal breast cancer patients and 1142 controls genotyped with 555 148 SNPs. The downloaded sample excludes data that failed the original quality control (QC) employed by Hunter et al.1 Using PLINK18 we removed an additional 93 SNPs with inconsistent or ambiguous genomic locations, 8648 SNPs and one individual with >10% of genotypes missing, 53 615 SNPs with minor allele frequencies (MAF) <0.05 and 4308 SNPs with large deviations from Hardy–Weinberg (HW) (χ12>10) in the controls. Some SNPs failed QC for more than one of these reasons. After merging with the HapMap (phase 3) reference populations and undertaking multidimensional scaling cluster analysis, we identified and removed an additional four samples, which did not cluster strongly with the CEU (Western European origin) reference population, suggesting potential admixture. After QC we had a total of 498 786 SNPs typed in 1143 cases and 1139 controls.

We increased genotypic coverage by 98% through imputation of missing SNP genotypes, after merging CGEMS data with CEU data, and a total of 544 683 markers were imputed ‘sufficiently’ according to PLINK defaults. The default cut-offs designate reliably imputed SNPs and require there to be ‘information content metric’ values >0.8 and SNP genotype imputation for 90% or more individuals in the sample. Before combining with CGEMS, further QC removed 55 692 SNPs (488 991 imputed SNPs retained), of which 308 had >10% missing genotypes, 6800 showed significant HW deviation in the controls and 49 019 had MAF <0.05. Some SNPs failed QC for more than one of these reasons. The combined data set comprised genotypes for 987 777 SNPs (Supplementary Table 1).

POSH sample

Turnbull et al.4 describe an analysis, which includes 308 POSH cases genotyped on an Illumina Infinium (Illumina, Inc, San Diego, CA, USA) 660k array forming part of a study of cases preferentially selected to have at least two affected first or second degree relatives. Most cases had been screened and found to be negative for germline mutations in the BRCA1 and BRCA2 genes. Data for 294 POSH cases and 580 030 SNPs were provided by the lead authors. Their QC procedures4 resulted in the exclusion of 63 112 markers. In addition, a total of 14 individual samples were excluded because of apparent non-European ancestry (eight samples) and heterozygosity P-values <10−5 (six samples). We used WTCCC phase 2 genotypic data from the European Genotype Archive (EGA) (http://www.ebi.ac.uk/ega/page.php?page=study&study=EGAS00000000028&cat=www.wtccc2.studies.xml&subcat=controls) as controls. Both 1958 birth cohort and UK National Blood Service controls (UNBS) Illumina 1.2M genotypic data sets were used in the analysis. During QC on the controls, using the exclusions lists supplied by the WTCCC, we removed 231 samples and 215 732 SNPs from the 1958 birth cohort and 236 samples and 214 848 SNPs from the UNBS data. From the original 1 155 595 markers genotyped across 2930 1958 birth cohort controls and 2737 NBS controls, first-stage QC yielded 939 863 SNPs in 2699 controls and 940 747 in 2,501 controls in the two samples, respectively.

We determined a common subset of 536 205 SNPs typed in both WTCCC controls and POSH cases genotypic data sets. Our standard QC in the combined data set identified an additional 3926 SNPs deviating from HW equilibrium (χ12>10) in the controls, 106 SNPs with >10% missing genotypes and 23 738 SNPs with MAF<0.05. The final data set comprised genotypic information for 280 cases, 5200 controls and 506 610 SNPs after removal of 2138 SNPs contained in the exclusion lists provided by the lead authors.

Using PLINK we then determined 535 110 sufficiently imputed SNPs, of which 17 387 were excluded because of significant HW deviation, 270 were excluded with >10% missing genotypes and 46 410 SNPs with MAF<0.05 (or excluded on more than one criteria). The final imputation-inclusive data set comprised 979 409 SNPs for 5200 WTCCC phase 2 controls and 280 POSH cases, suggesting an increase in genotypic coverage by imputation of ∼93%.

WTCCC sample

WTCCC phase 1 breast cancer data were obtained from the European Genotype Archive (http://www.ebi.ac.uk/ega/page.php?page=study&study=EGAS00000000024&cat=www.wtccc.studies.xml.ega2&subcat=BC). The aggregated genome-wide data set of genotypes for 15 436 SNPs across 1045 cases and 1476 controls was subjected to QC following the annotation files provided, which resulted in 2859 SNPs and 51 samples (41 cases and 10 controls) being removed. Marker exclusion was based on poor genotype call scores, high missing genotype rate, monomorphic SNPs and HW deviations. Sample exclusion was due to putative relatedness of individuals, questionable ancestry, missing genotypes and positive BRCA2 testing. The final data set comprised of 12 577 SNPs, 1004 cases and 1466 controls.

Genome-wide imputation in QC-clean WTCCC data, after merging with the CEU data (111 individuals and 1 615 203 SNPs), identified 36 587 sufficiently imputed SNPs. Our standard QC in the combined data set identified 28 SNPs with >10% missing genotypes, 1829 SNPs with significant HW deviation and 7052 SNPs with MAF<0.05. The final imputation-inclusive data set comprised 40 300 SNPs, 1004 cases and 1466 controls.

Composite likelihood mapping

We used the CHROMSCAN program,19 which models association between disease and SNP markers in a chromosome region to compute a maximum composite likelihood location, S, for a disease variant, a standard error for that location, a 95% confidence interval and a P-value. The program incorporates the underlying LD structure in the region as a LD unit (LDU) map,20 which represents regions of strong LD (blocks) as plateaus and recombination hot-spots as steps when plotted against the kilobase map. Gene mapping on the LDU map has been shown to increase power and accuracy.21 The LDU maps were made from the CEU sample (HapMap phase II and build 36 of the human genome sequence). CHROMSCAN establishes significance for a region through a permutation test, which employs a large number of replicates for which the disease phenotype is randomised by shuffling. Computing probabilities for the test statistic based on the null P-value distribution avoids distortions (inflation and deflation) in the P-value distribution. The program generates an information matrix from which information weights, W, for location S are obtained along with a standard error. CHROMSCAN analysis was performed in fixed regions of four LD units, which facilitates meta-analysis and provides reasonable coverage (on average) of each region (>30 SNPs in a ∼550 000 SNP scan, given that there are ∼60 000 LDUs in the CEU genome22, 23).

Meta-analysis

For combining information across the CGEMS, POSH and WTCCC samples, we examined Fisher's combined probability test (CPT),24 the Z-transform test (ZTT),25 and the weighted Z-test (WZT).26 Whitlock's study26 shows that the WZT has greater power and precision than the CPT and ZTT in simulated data. We used the CPT to combine permutation-based P-values from the three samples (or fewer if a sample contained no information for a given region) in corresponding four LDU regions as: , where k is the number of samples and χF2 has a χ2 distribution with 2k degrees of freedom. χ2 were converted to the corresponding probability and χ12 using the appropriate functions from the gsl library (http://www.gnu.org/software/gsl/manual/html_node/The-Chi_002dsquared-Distribution.html). The ZTT first converts P-values to the corresponding (signed) standard normal deviates Z. Z-scores were obtained from the permutation-based probabilities from each sample using the gsl library as above. The combined Z-score (ZS) is obtained as: , where ZS has a standard normal distribution. We computed the corresponding combined P-values and the corresponding χ12 using the gsl library. The WZT is a weighted version of the ZTT. We weighted each sample by information W, from the composite likelihood model as above, and obtained .

We evaluated statistical heterogeneity from pooling evidence across samples following Tapper et al.27 which computes , the heterogeneity χ2 with k-1 degrees freedom (χk−12) where and Si is the maximum composite likelihood LDU location from the ith sample with data in the LDU region under consideration.

Results

The meta-analysis combines association tests in four-LDU regions across CGEMS (14 340 regions with data), POSH (13 908 regions with data) and WTCCC (3679 regions with data) samples. Although the WTCCC sample comprises far fewer SNPs than the other samples (Supplementary Table 1) and only covers non-synonymous SNPs, it provides additional information that can be exploited in composite likelihood-based meta-analysis. Turnbull et al.4 in their Table 1 detail 13 loci confirmed as associated with breast cancer through association studies (both genome-wide and candidate gene based). We examined evidence for each of these regions in our meta-analysis (Table 1). The FGFR2 locus is known to contribute one of the largest effect sizes among the common susceptibility loci detected so far. The gene was identified in the CGEMS sample by Hunter et al.1 who recognised it as a risk factor for sporadic postmenopausal breast cancer. It is noteworthy that the evidence is supported by the small number (280) of POSH individuals, indicating it may have roles in both early-onset and post-menopausal breast cancer. The 95% confidence interval for the location of the associated variant for CGEMS and POSH samples spans 26 kb, which corresponds to intron 2 of the gene (Table 2). Intron 2 is not tagged by the non-synonymous SNP panel from the WTCCC genotypes so this sample provides no additional evidence. The combined meta-analysis χ12 of 25.75 (ZTT) makes this the highest ranked region and the only region achieving genome-wide significance, P=0.006, after a conservative Bonferroni correction for 14 340 tested regions.

Table 1 Results for known breast cancer genes and regions
Table 2 The FGFR2 and 8q24 regions: association with post-menopausal and early-onset breast cancer

The evidence for other breast cancer genes (Table 1) is variable and reflects the relatively low power of fairly small, heterogeneous and incompletely genotyped samples for detecting low risk variants. However, it is noteworthy that the 8q24 breast cancer region has the fifth highest rank (ZTT χ12 14.35) using combined evidence from CGEMS and POSH samples (but not WTCCC as this is a ‘gene desert’28). There is evidence, therefore, that this region may be involved in both early-onset and post-menopausal breast cancer. The 8q24 region has well established associations with prostate, breast and colorectal cancer.8 It has been shown that the effects of a number of risk alleles in the region are cancer site specific. Easton et al.2 first reported an association between rs13281615 (128.42 Mb) (Figure 1) and breast cancer and subsequently Fletcher et al.8 reported a protective effect at rs13254738 (128.17 Mb) with limited evidence for interaction between the two. The LDU map of the region (Figure 1) and the likelihood surface for the POSH data (Figure 2) shows that the 95% confidence interval (128.38–128.44 Mb) (Table 2) includes rs13281615. However, rs13254738 lies outside of the four LDU region and that region was not identified as high ranking in the meta-analysis. As pointed out by Fletcher et al.8 the relatively low power of studies undertaken so far suggests that 8q24 may contain several additional breast cancer susceptibility loci. Composite likelihood evidence from the CGEMS sample (Table 1) places the peak at 128.497 Mb (Figure 1) within the prostate cancer ‘region 3’,8 although the most significant single SNP in the region is rs10447995 at 128.427 Mb, much closer to, and in the same LD block as rs1328165 and the peak from the POSH sample. Multiple signals reflect the existence of a cluster of, possibly independently acting, associated variants with heterogeneous influences on disease phenotype, which may impact, for example, on differences in age of onset.

Figure 1
figure 1

Linkage disequilibrium (LD) map of the 8q24 region. The LD map locations of the most significant (ms) SNPs are shown along with the maximum composite likelihood locations (S) for CGEMS and the prospective study of outcomes in sporadic versus hereditary breast cancer (POSH), and the 95% confidence intervals for the locations. The evidence from the POSH data maps to the same LD block as the rs13281615 variant previously implicated in breast cancer, whereas the CGEMS evidence maps to nearby ‘region 3’, but within the same four LD unit region, implicated in prostate cancer.

Figure 2
figure 2

Composite likelihood in the prospective study of outcomes in sporadic versus hereditary breast cancer (POSH) 8q24 region. The test statistic difference X in (composite) likelihood for null and disease association models in the 8q24 genomic region. The location of the rs13281615 variant previously implicated in breast cancer is within the 95% confidence interval suggesting a role for this region in early-onset disease.

The COX11 gene region ranks one hundred and seventh (ZTT) and the WTCCC study provides contributory information in the meta-analysis. The most significant non-synonymous WTCCC SNP in this region is rs7222197, which is in the STXBP4 gene, adjacent to COX11. However this presumably reflects LD across this region as the SNP considered to be most strongly associated with risk is correlated with elevated levels of cytochrome C assembly protein 11 (COX11) and not with altered expression at STXBP4.29 However, causal relationships between this effect and breast cancer predisposition have not been determined.

Of the 13 known breast cancer genes/regions listed in Table 1 the highest χ12, suggesting a more powerful test, is achieved by the ZTT in six cases, by the WZT in five cases and by the CPT in two cases. Consistent with this pattern, the highest sum of χ2 is for the ZTT (69.02) whereas the lowest is for the WZT (63.6). Given this empirical support for the ZTT we list the 10 highest ranked regions on the basis of this test (Table 3). The highest ranking regions identify the FGFR2 region as most significant. Also represented is the region containing the PIK3AP1 gene (χ12 13.06). This gene is known to be associated with a key carcinogenesis pathway and has been found to be upregulated in the peripheral blood of breast cancer patients. Expression of this gene has been used as part of a set of profiles as a molecular predictor of breast cancer.30

Table 3 Highest ranked regions (ZTT) in meta-analysis (CGEMS, POSH and WTCCC)

There is evidence for significant statistical heterogeneity for the COX11 and SLC4A7 breast cancer regions (Table 1) and for two of the top-ranked regions (Table 3). Among a number of likely sources for heterogeneity, evidenced by combining statistics from these samples, are variations in marker coverage and informativeness, differences in breast cancer phenotype, and the impact of multiple (or different) association signals within a region. Increasing sample sizes, more consistent marker coverage, more refined breast cancer phenotypes and fine mapping (including mapping within smaller LDU windows in samples with higher marker density) will reduce the impact of these sources of heterogeneity in future.

Discussion

Composite likelihood-based meta-analysis in discrete regions defined on an underlying LD map has a number of advantages over single SNP based approaches for combining evidence. Model fitting combines evidence across a number of SNPs giving a point estimate along with a confidence interval in the region of interest. Within fixed regions increasing marker density, including that achieved by imputation of genotypes, does not increase the multiple testing penalty. Combination of P-values across regions using the ZTT enables meta-analysis of samples that have heterogeneous phenotypes (such as early and late-onset disease in the POSH and CGEMS samples respectively) and widely differing marker coverage profiles (such as the WTCCC compared with CGEMS and POSH data sets). The empirical evidence from known breast cancer gene regions (Table 1) marginally supports the use of the ZTT rather than the weighted test (WZT). The Fisher test (CPT) clearly lacks power as noted by Whitlock.26 The weighted Z-test favoured by that author is likely to be the most powerful where reliable weights are available. The apparent modest superiority of the ZTT over the WZT in our study may reflect instability in the weights where the completeness of marker coverage and marker information content is particularly variable in these heterogeneous samples. There is also a statistical argument, pointed out by Whitlock,26 that the P-values are already weighted by sample size when using the ZTT.31 The empirical evidence supports the use of the ZTT in composite likelihood-based meta-analysis but examination of the weighted scores may identify further potential candidates for follow up.

This meta-analysis supports existing evidence for at least two known breast cancer genes and regions (FGFR2 and 8q24), despite the relatively small number of samples included and heterogeneous marker coverage in the three data sets. The application of this approach to combination of evidence from larger samples and for defined breast cancer sub-types may be useful to further characterise the genetic basis of breast cancer and contribute to the identification of some of the ‘missing heritability’.

The WTCCC non-synonymous SNP panel has, understandably, limited genome coverage even after imputation of SNPs. Perhaps remarkably, meta-analysis of 1200 SNPs known to be associated with diseases, found that for 40% of SNPs there was no association with known exonic sequences.32 There are at least five associated breast cancer non-genic regions and other variants within genes are known to be intronic (for example, the FGFR2 association).

Although only the FGFR2 gene achieves genome-wide significance in the combined sample after correction for the number of regions tested, we note that the high ranked regions include the PIK3AP1 gene, which is a promising candidate for further study. We also present evidence of association for FGFR2 genes and 8q24 regions with early-onset disease, despite the relatively small number of early-onset (POSH) cases studied. Early-onset cancers include a greater proportion of estrogen receptor negative (ER−) tumours but most genome-wide association studies undertaken so far have focussed on later onset disease and have had greater power to detect genes associated with ER+ tumours.9 The evidence suggests that the associations at genes such as FGFR2 and 8q24 are stronger for ER+ tumours and there is reportedly greater FGFR2 expression in ER+ cell lines. Genetic analysis of larger samples of early-onset cases, stratified by tumour sub-types, is essential to fully comprehend the heterogeneity in phenotype-genotype associations and the degree to which early and late-onset disease may have different genetic backgrounds.