Breast cancer is the most common cancer in women worldwide and women in the UK have a one out of 10 lifetime risk of developing the disease. First-degree female relatives of breast cancer patients have an approximately two-fold increased risk over the general population, but less than 25% of this excess risk is explained by inherited mutations in known high penetrance breast cancer susceptibility genes, such as BRCA1 and BRCA2 (Antoniou et al, 2001). Data from large multiple case families suggest that there will be few other high penetrance genes. It is more plausible that there are multiple common low risk (low penetrance) genetic variants, which are associated with relatively small effects on risk in the individual, but contribute substantially to the overall risk in the population (Antoniou et al, 2002).

Controlling the progression of cells into and through S phase of the cell cycle is important in regulating DNA synthesis and thus cell proliferation. The retinoblastoma protein (pRb) is critical for regulating not only progression of cells from G1 into S phase, but also progression of cells through S phase (Weinberg, 1995; Knudsen et al, 1998; Niculescu et al, 1998). It acts as a negative regulator of cellular proliferation by sequestering a variety of nuclear proteins involved in cellular growth that are released when pRb is phosphorylated. The gene RB1, encoding the protein, was mapped to chromosome 13q14.12–13q14.2 in children who developed retinoblastoma, a rare cancer of the eye, and was the first tumour-suppressor gene to be cloned (Friend et al, 1986). It consists of 27 exons that are distributed over 180 kb. Mutations are spread across the gene and approximately 80% of patients with hereditary mutations have bilateral disease. Hereditary retinoblastoma patients are at risk of developing and dying of second primary cancers in childhood and adolescence (Francois et al, 1980; Draper et al, 1986; Lueder et al, 1986; DerKinderen et al, 1988; Olsen et al, 1990) and excess mortality from second malignancies in retinoblastoma survivors was found to persist during long-term follow-up into adulthood. Female patients have a higher mortality from second tumours (RR=39) than males (RR=22) (Eng et al, 1993). Germline mutations in specific codons or regions of the RB1 gene could therefore predispose to the development of a second tumour. Subsequent studies have shown somatic mutation of RB1 in a variety of cancers, including sarcomas, breast cancer, lung cancer and genitourinary cancers (Benedict et al, 1988; T'Ang et al, 1988; Hensel et al, 1990; Sasano et al, 1990). Common variants in RB1 are therefore candidate for low to moderate risk breast cancer alleles.

Association studies, using very large sets of affected cases and suitably selected controls, are considered to be the most powerful method for finding common low penetrance disease susceptibility genes. The aim of this study was to test the hypothesis that one or more variants in the gene is associated with breast cancer using a single-nucleotide polymorphism (SNP) tagging approach in a large, British breast cancer case–control study. In order to have good power to detect small relative risks we have restricted our attention to common SNPs and haplotypes (frequency 5%).

Patients, materials and methods

Patients and controls

Cases were drawn from SEARCH (breast), an ongoing population based study, with cases ascertained through the East Anglian Cancer Registry. All patients diagnosed with invasive breast cancer below age 55 years since 1991 and still alive in 1996 (prevalent cases, median age 48 years), together with all those diagnosed <70 years between 1996 and the present (incident cases, median age 54 years) were eligible to take part. In all, 67% of eligible breast cancer patients returned a questionnaire and 64% provided a blood sample for DNA analysis. Controls were randomly selected from the Norfolk component of European Prospective Investigation of Cancer (EPIC). European Prospective Investigation of Cancer is a prospective study of diet and cancer being carried out in nine European countries. The EPIC-Norfolk cohort comprises 25 000 individuals resident in Norfolk, East Anglia) – the same region from which the cases have been recruited. Controls were not matched to cases, but were broadly similar in age, being aged 42–81 years old at blood draw (median age 63 years). The ethnic background of both cases and controls as reported on the questionnaires was similar, with >98% being white. The study was approved by the Eastern Region Multicentre Research Ethics Committee, and all patients gave written informed consent.

The total number of cases available for analysis was 4474 of whom 27% were prevalent cases. The samples have been split into two sets in order to save DNA and reduce genotyping costs: the first set (n=2271 cases and 2280 controls) is genotyped for all SNPs and the second set (n=2203 cases and 2280 controls) is then tested for those SNPs that show marginally significant associations in set 1 (P-heterogeneity or P-trend <0.1). This staged approach substantially reduces genotyping costs without significantly affecting statistical power. Cases were randomly selected for set 1 from the first 3500 recruited, with set 2 comprising the remainder of these plus the next 974 incident cases recruited. As the prevalent cases were recruited first, the proportion of prevalent cases was somewhat higher in set 1 than set 2 (33 vs 20%). Median age at diagnosis was similar in both sets (51 and 52 years old, respectively). There was no significant difference in the morphology, histopathological grade or clinical stage of the cases by set or by prevalent/incident status.

Identification of SNPs

Single-nucleotide polymorphisms were initially identified through the following SNP databases: ENSEMBL,, dbSNP,, and The RB1 gene mutation database, (Lohmann, 1999). Eleven SNPs, encompassing the RB1 gene and with a reported frequency 5% in the Caucasian population according to the public databases were initially examined in a set of 96 individuals from the EPIC-Norfolk population in order to confirm their presence in the British population (Table 1) and to estimate pairwise correlation coefficients (rp2). Two SNPs, rs198610 and rs198580 were subsequently found to have a frequency lower than 5% in our East Anglian sample set. Strong linkage disequilibrium (LD) across the RB1 gene was observed (illustrated by r2 values shown in Table 1).

Table 1 Selection of SNPs across RB1

During the course of the study the NIEHS EGP Project ( released resequencing data for the coding sequence of RB1 gene. This represents 38% of the genomic sequence. Data were available for a panel of 90 individuals representative of US ethnicities: including 24 European Americans, 24 African Americans, 12 Mexican Americans, 6 Native Americans and 24 Asian Americans (PDR90) (Livingston et al, 2004). It is known that there is greater genetic diversity in individuals of African origin but ethnic group identifiers for the PDR90 samples are not available. We identified 28 of the samples most likely to be African-American in this population by comparing the genotypes for the PDR90 samples with the genotypes for the same SNPs from the National Heart, Lung, and Blood Institute Variation Discovery Resource project African-American panel. Data from the remaining 62 individuals were used to identify a set of tagging SNPs (stSNPs). Of 279 SNPs identified in the PDR90 samples, only 25 are likely to have a frequency 0.05 in Caucasians (Table 1).

We used the programme Tagger to select a set of SNPs to tag all the known common variants (Paul de Bakker, Tagger uses a strategy that combines the simplicity of pairwise methods with the potential efficiency of multimarker approaches. It begins by selecting a minimal set of markers such that all alleles to be captured are correlated at an rp2 greater than 0.8 with a marker in that set. It then tries to capture SNPs which could not be captured in the pairwise step using multimarker tests constructed from the set of markers chosen as pairwise tags.

Four of the SNPs we had initially chosen were not present in the PDR90 data set (all in intron 17). The remaining seven were forced in as tagging SNPs. An additional eight tagging SNPs were chosen, but an assays could not be designed for two of these, neither of which had alternative tags. Another tSNP was found not to be polymorphic in our population. Thus, 12 tSNPs were genotyped in our case–control sample. These tagged 22 out of 25 SNPs with rp2>0.8 and one SNP (rs1573601) was tagged by a three SNP haplotype combination. The SNPs which failed assay design were tagged with rp2=0.19 and rp2=0.28. The contribution of the four additional SNPs identified through public databases to the tagging of other known variants could not be estimated.


We genotyped all samples for the 15 SNPs using the ABI PRISM 7900 sequence detection system or ‘Taqman’ (Applied Biosystems, Foster City, CA, USA). Forward and reverse primers, and FAM and VIC labelled probes were designed by Applied Biosystems (ABI Assays-by-Design or ABI Assays-on-Demand). Sequences for primers and probes are available on request. We carried out PCR on DNA (10 ng) using TaqMan universal PCR master mix (Applied Biosystems) in a 5 μl reaction. Amplification conditions on MJ Tetrad thermal cyclers (GRI) were as follows: one cycle of 95°C for 10 min, followed by 40 cycles of 95°C for 15 s and 60°C for 1 min. We read the completed PCRs on an ABI PRISM 7900 Sequence Detector in end point mode using the Allelic Discrimination Sequence Detector Software (Applied Biosystems). For the software to recognise the genotypes, we included two nontemplate controls in each 384-well plate. For set 1 and set 2, cases and controls were arrayed together in twelve 384-well plates and a thirteenth plate contained eight duplicate samples from each of the twelve plates to insure a good quality of genotyping. For each SNP, failed genotypes were not repeated.

Statistical methods

For each polymorphism, deviation of the genotype frequencies from those expected under Hardy–Weinberg equilibrium was assessed in the controls by χ2 tests. Genotype frequencies in cases and controls were compared by χ2 tests (P-heterogeneity, 2 d.f.). We also tested for an allele dose effect assuming a multiplicative codominant model using unconditional logistic regression (P-trend, 1 d.f.). The genotypic specific risks were estimated as odds ratios (ORs) with associated 95% confidence limits. For SNPs that were significant at the 5% level we also compared the fit of dominant and recessive models with the codominant model by combining the appropriate genotype categories.

In addition to the univariate analyses we carried out global haplotype test and a specific haplotype test for a three SNP haplotype that tagged a common variant. Haplotype frequencies and subject-specific expected haplotype indicators were calculated separately using the programme TagSNPs, which implements an expectation-substitution approach to account for haplotype uncertainty given unphased genotype data (Stram et al, 2003a, 2003b). Subjects missing more than 50% genotype data were excluded from haplotype analysis. We considered haplotypes with greater than 4% frequency in either cases or controls to be ‘common’. Rare haplotypes were pooled. We used unconditional logistic regression to test the global null hypothesis of no association between haplotype frequency and breast cancer, by comparing a model with multiplicative effects for each common haplotype (treating the most common haplotype as referent) to the intercept-only model. Haplotype-specific ORs were also estimated with their associated confidence intervals.

Screening of the P2RY5 gene

We looked for the presence of polymorphisms with rare allele frequency 5% in the P2RY5 gene (RefSeq NT_024524) in our population by sequencing a set of 48 genomic DNA samples from the UK breast cancer patients. Sequencing was performed on the ABI Prism 3100 Capillary DNA Sequencer (Applied Biosystems) according to the manufacturer recommendations. The pairs of primers used for the sequencing of the P2RY5 gene are available from authors on request.


Association analysis of SNPs

Genotype distributions in the controls did not differ significantly from those expected under Hardy–Weinberg equilibrium for any of the SNPs. Of the 15 SNPs, 12 were genotyped only in set 1. SNPs rs2854344 and rs4151611 were tested in set 2 because they met the threshold for significance (see methods) and rs198580 was also tested in set 2 because we had observed a borderline association for this SNP with another disease phenotype (data not shown).

There was no significant difference in genotype frequencies between cases and controls for SNPs rs1981434, rs2854345, rs520342, rs399413, rs4151540, rs198610, rs2227311, rs9535032, rs425834, rs4151611, rs4151620, rs3092904 and rs4151636 (Table 2). For two SNPs, rs2854344 and rs198580, unadjusted P-values for comparison of genotype distribution between cases and controls were below the 5% level (P-heterogeneity=0.02 and 0.03 and P-trend=0.007 and 0.018, respectively). Genotype-specific risks for all tagging SNPs are in Table 2. There was no association in controls between age and genotype frequency for any of the SNPs, and age-adjusted genotype-specific ORs were similar to the unadjusted ORs (data not shown). The minor allele of rs2854344 appeared to confer a reduced risk of disease with a dominant model fitting the data slightly better than the codominant one (P=0.006 vs 0.007). The dominant model also fit the data best for rs198580 (P=0.01 vs 0.02), also with a protective effect of the minor allele, even though the risk estimate for rare homozygous individuals was 1.4 (0.31–6.24).

Table 2 SNPs genotyped in the study set

One SNP, rs1573601, was tagged by the haplotype consisting of the three rare alleles of rs1981434, rs4151540 and rs3092904. The frequency of this haplotype was similar in cases (0.25) and in controls (0.25) (P=0.94).

Haplotype analysis

As the complete gene was not resequenced, it is possible that important functional variants that have not been tagged by the 15 SNPs genotyped could have been missed. We therefore carried out a comparison of common haplotype frequencies in cases and controls in addition to the univariate and specific three SNP haplotype analyses. There were five common haplotypes which accounted for 84% of all haplotypes in the control population (Table 3). We found no evidence of differences in common haplotype frequencies between cases and controls (P=0.08, 5df). Table 3 shows the haplotype-specific ORs, none of which differed significantly from the unity.

Table 3 RB1 haplotype analysis using the 15 SNPs genotyped in the study set

Exclusion of the P2RY5 gene

The SNP rs2854344 lies in intron 17 of RB1, which at 72 kb is the largest intron of the gene. The intron contains an open reading frame encoding the G protein-coupled receptor P2RY5 (Purinergic Receptor P2Y, G-protein coupled, 5) in the reverse orientation relative to the transcription of RB1 (Herzog et al, 1996). The P2RY5 gene consists of only one coding exon and rs2854344 lies 11kb 5′ of this exon. Various bioinformatic tools (NIX, Nucleotide Identify X software,; PupaSNP, suggest that the variant rs2854344 (and the variant rs198580 in intron 19 of RB1) does not have any functional effect, or alter dramatically the structure of RB1 or of P2RY5 (data not shown). Therefore, causal variant(s) in LD with rs2854344 and rs198580 could be located within RB1 or within P2RY5. In order to investigate a possible association with variants within P2RY5, we sequenced the unique exon of P2RY5 and 500 bp of its flanking sequences in 5′ and 3′ in a panel of 48 controls. No coding SNP was identified. The nearest polymorphisms were rs2227311, located 473 bp upstream, and rs4151551, located 86 bp downstream of P2RY5. No association with breast cancer risk was found with rs2227311 (Table 2), and rs4151551 was subsequently tested in set1. No association with breast cancer risk was found with rs4151551 (P-heterogeneity=0.74). We conclude that, if the observed associations are real, it is likely that it is variation related to RB1 and not P2RY5 that modifies breast cancer susceptibility.


The case–control study design is well suited to the identification of small-effect genes that are likely to underlie common, complex diseases such as breast cancer (Risch, 2000). Two approaches have been proposed. The traditional, hypothesis-driven approach is to investigate SNPs on the basis of their putative biological relevance, in particular SNPs in coding regions, as they are more likely to influence directly the traits under study (Tabor et al, 2002). Alternatively, when many markers, both coding and noncoding, are available in a gene, it may be more efficient to select only tagging SNPs, that is, SNPs that capture the majority of the genetic variation of the gene (Johnson et al, 2001; Stram et al, 2003a, 2003b; Thompson et al, 2003). There are no common coding SNPs in the RB1 gene and since the regulatory SNPs have not yet been characterised we chose the indirect approach, which allows detection of association between a particular genomic region and the disease, whether or not the SNPs themselves have a functional effect (Gabriel et al, 2002; Cardon and Abecasis, 2003; Zondervan and Cardon, 2004). We are confident that our set of selected markers provides enough information about the remainder of the common SNPs in the gene, and any unknown common variants will either be tagged by the stSNPs or by the common haplotypes that they generate (Haiman et al, 2003; The International HAPMAP Consortium, 2003; Carlson et al, 2004). It is worth noting that the two SNPs that were associated with breast cancer were not identified in the EGP resequencing data, and would have been missed if only these data had been used to select tSNPs.

We found that the minor alleles of two SNPs, rs2854344 and rs198580 were associated with breast cancer susceptibility at a nominal significance level of 0.05. However, we have tested 15 SNPs for association and the possibility that the findings are the result of a Type I statistical error should not be discounted. Standard adjustments for multiple hypothesis testing, such as the Bonferroni correction, are too conservative, as they assume that the tests are independent. We therefore used permutation testing by randomly shuffling the case–control status to obtain an empirical adjusted P-value for the most significant association detected in the primary tests of association (i.e. P-trend=0.007). In 1000 random permutations, a P-value at least as significant as this was obtained on 68 occasions, giving a P-trend adjusted for multiple testing of 0.068 for the association of rs2854344 with breast cancer. Thus, the observed association is of borderline significance.

An alternative explanation for the observed results is confounding due to hidden population stratification. This occurs when allele frequencies differ between population subgroups and cases and controls are drawn differentially from those subgroups. However, it seems unlikely that population stratification is relevant here because the cases and controls were drawn from the same ethnic groups (both >98% of northwestern European ancestry). Furthermore, we have found no evidence for association between pairs of 64 unlinked markers (2016 tests) in the controls, which suggests that there is unlikely to be significant substructure in our population (Goode et al, 2005).

Assuming the results to be real, it may either be due to a direct causative effect of the SNPs tested, or it may be because they are markers for other functional variants. The associated SNPs lie in intron 17 and intron 19 of RB1, and it seems unlikely that either of them has direct functional effects. pRb undergoes cell-cycle-dependent phosphorylation during G1, and this modifies its interaction with at least some members of the E2F family, which regulates the transcription of many genes required for S phase (Kaelin et al, 1992). The finding of protective RB1 alleles was unexpected, as deletions or inactivating mutations of the gene observed in tumours generally lead to an absence of negative control of the protein on the cell proliferation. No coding variants were identified during EGP resequencing and so it is plausible, but unlikely that the presence of the still unidentified common variant prevents the protein from appropriate dephosphorylation. However, only a small number of subjects were resequenced and it is possible that the observed association is due to correlation with an unidentified rare coding variant that was not present in the resequenced samples. Nor have we been able to identify any common variants in P2RY5, the gene within intron 17 of RB1 that might explain the association. Again rare variants cannot be excluded. The more likely hypothesis is the presence of a SNP in the promoter or in a regulatory element that affects the level of pRb. It is possible that the causal variant might induce higher levels of pRb or expression at critical times than the common allele.

Further studies would be required to identify and investigate the mechanism of action of a causative variant. However, before such studies are contemplated, these putative associations need to be tested and confirmed in independent breast cancer studies. It would also be interesting to check the involvement of the protective alleles in other cancer types, in particular in melanoma, and in bone, connective tissue, ovarian and uterine cancer which are also cause of death in retinoblastoma long term survivors (Eng et al, 1993).