Introduction

The data generated by the HapMap project have determined the common patterns of DNA sequence variation in the human genome from populations across four geographical regions.1, 2 This information is providing an unprecedented view of human genetic diversity that is used primarily in association studies but will give insights into many other areas of research such as studies of linkage disequilibrium, haplotype block distributions, the localisation of recombination hotspots, effects of natural selection and how these have shaped human genetic variation. On top of that, the scientific community now has access to a draft of the chimpanzee genome (Pan troglodyte), which was recently released.3 At nearly all SNP locations in human genes, chimps have a nucleotide identical to one of the human nucleotides at nearly all SNP (single-nucleotide polymorphism) locations in human genes which means that our common ancestor almost certainly had the same nucleotide. The search for ancestral and derived nucleotides has recently been the object of attention in the scientific community and may uncover ‘footprints’ of positive selection that have occurred recently in humans and may explain different susceptibilities to disease. One example of this is the work of Puente et al,4 who have suggested that small differences in cancer genes might influence the difference in cancer susceptibility between the two species.

Although some reviews have reported linkage disequilibrium (LD) extending over distances greater than 100 kb5, 6, 7, 8 the average length of LD blocks in European populations is about 22 kb, although at least 50% of the European human genome exists in blocks of around 44 kb.9 Besides, it has been suggested that some of these regions of extended LD may play an important role in determining the genetic bases of human phenotypic differences.10 Regions of LD are characterised by strong association between alleles, low haplotype diversity and low recombination rates.11 In addition, some of the larger LD blocks have recently been associated with positive selection through human evolution.8 Several authors have described that regions with limited haplotype diversity have at least one pair of high-frequency haplotypes composed of completely mismatching SNP alleles, also referred to as a Yin–Yang pair, and these pairs are suspected to be of a very ancient origin.12, 13

We have recently reported that only 12% of a set of cancer-related genes contained at least one LD block larger than 60 kb.14 In this present study, we aimed to further test whether such genes with longer LD blocks in the Spanish population were subject to some sort of selection and make some contribution to disease aetiology. We first examined whether 20 cancer-related genes with LD blocks larger than 60 kb fell into any particular category of function. Second, we studied the haplotype block structure in each of the genes, including the frequency distribution, the presence of Yin–Yang pairs, and whether the ancestral haplotype was present in Spanish controls and then compared all these factors across the four sampled HapMap populations (CEU, YRI, JPT and CHB). Third, we looked for positive selection and finally, we study whether these genes were associated with breast cancer by comparing their haplotype frequency distributions among Spanish breast cancer cases and controls.

Materials and methods

Study population

The recruitment of cases and controls has been previously described.15 Briefly, cases were 864 women with breast cancer and mean age at diagnosis of 50 years (range: 23–86 years) recruited between 2000 and 2004. Of these, 574 were consecutively recruited via three public hospitals in Spain: Hospital La Paz, the Fundación Jiménez Díaz, Hospital Monte Naranco, while 290 were cases attending our family cancer clinic for genetic testing who had at least one affected first-degree relative. Controls were 845 Spanish women free of breast cancer at ages ranging from 23 to 86 years (mean=53 years), recruited between 2000 and 2005 via the following sources: the Menopause Research Centre at the Instituto Palacios, the College of Lawyers; the National Blood Transfusion Centre, the Catalan Institute of Oncology (ICO); and from the Centre for the Investigation of Cancer (CIC). Informed consent was obtained from all participants, and the study was approved by the Institutional Review Board of Hospital La Paz, Madrid.

Candidate gene choice, SNP selection and haplotype analysis

The 121 genes and SNPs were selected according to previously published criteria:14, 15 genes previously reported to be associated with or known to be involved in cancer; genes involved in cell cycle pathways; DNA repair; cell communication; hormone metabolism; apoptosis; carcinogen metabolism; cell adhesion; cell proliferation and differentiation; nucleoside, nucleotide and nucleic acid metabolism; oncogenesis; developmental processes; and/or signal transduction. The main criterion for SNP selection was marker density as a function of LD with priority given to tag-SNPs defining common haplotypes.15 The 20 genes with LD blocks larger than 60 kb and their corresponding SNPs studied are detailed in Supplementary Table 1. The final average SNP density was one SNP for every 8.7 kb.

Genotyping

Genomic DNA from subjects was isolated from peripheral blood lymphocytes using automatic DNA extraction (Magnapure; Roche, Mannhein, Germany) according to the manufacturer's recommended protocols. This DNA was quantified using picogreen and diluted to a final concentration of 50 ng/μl for genotyping.

Genotyping of SNPs was carried out using the Illumina Bead Array System (Illumina Inc., San Diego, CA, USA) according to the manufacturer's protocols.16 At least one duplicate and one negative control were included per 96-well plate, and six samples were duplicated across plates. The total number of duplicates across all plates was 35 (15 cases, 17 controls and a nonstudy child–parents' triad).

Assignment of ancestral alleles

We obtained FASTA sequences surrounding each SNP from the dbSNP database (build 35 of the human genome) and aligned those to the draft build of the chimp genome sequence, (http://genome.ucsc.edu/cgi-bin/hgBlat). For each SNP, we selected the best overall alignment, preferring alignments mapping to a unique chimp chromosome. We then inferred the ancestral state as the chimp allele at the corresponding position in the sequence, provided that the sequence quality score was greater than 20 at that site, and that it matched one of the human alleles.

Block definition and haplotype distribution

The LD blocks within genes were determined among controls using an R2 threshold of 0.8. among Spanish controls (Haploview v3.1.1).17 The LD structure of the 20 genes with LD blocks larger than 60 kb is shown in Supplementary Figure 1. Haplotypes (the combinations of variants along chromosomes) were inferred using PHASE 2.1. Haplotype blocks determined in the Spanish controls were applied to all four HapMap samples (CEU, YRI, JPT and CHB), and further analyses were restricted to SNPs in these blocks. The LD structure of each gene is shown in Supplementary Figure 1, and a full list of these 20 genes and selected SNPs is provided in Supplementary Table 1.

We identified Yin–Yang haplotype pairs within LD blocks according to the following criteria: at least five SNPs, each with a minor allele frequency (MAF) of at least 10%, or less SNPs meeting this frequency criterion but spanning more than 22 kb; and the least frequent of the Yin–Yang haplotype pair having a frequency greater than 3%.13 The ancestral haplotype was inferred for each LD block by combining the ancestral allele in each SNP per block considered.

Statistical analysis (haplotype association study)

Deviations from Hardy–Weinberg equilibrium were tested using the genhwi command in STATAv8.0.18 Differences in the haplotype distributions between cases and controls were tested using the χ2-test. PHASEv2.1 software19, 20 was used to impute haplotypes and compare their frequency distributions in cases and controls. Odds ratios (ORs), their 95% confidence intervals (CIs) and Wald's statistic P-values were estimated, via unconditional logistic regression (STATAv8.0), for haplotypes with frequency greater than 0.01, using the most frequent haplotype among controls as reference and assuming, for each subject, that the most likely imputed haplotypes were observed. Analysis of haplotypes was repeated using the haplo.stats library implemented in R, which compares haplotype frequencies in cases and controls in an unbiased way by including haplotype uncertainty in the estimation of ORs.

HapMap project data

We used Phase I data from the HapMap project, which comprises samples of Utah residents with ancestry from northern and western Europe (CEU); Han Chinese in Beijing, China (CHB); Japanese in Tokyo, Japan (JPT); and Yoruba in Ibadan, Nigeria (YRI). In some analysis, the CHB and JPT samples have been pooled and are referred to as the ASN sample. Haplotype blocks determined in the Spanish controls were applied to all four HapMap samples (CEU, YRI, JPT and CHB). Haplotype phase estimation for all the data was performed by the HapMap consortium using Phase 2.0. The phasing procedure also imputed all missing genotypes at SNPs with less than 20% missing data.

Gene ontology analysis

Genes were classified into Gene ontology (GO) categories21 using DAVID.22 Differences in frequency of GO categories among the 20 genes with long LD blocks compared to other 101 cancer-related genes were tested for using Fisher's exact test.

Recent positive selection

We assessed evidence of recent positive selection in the 20 candidate genes using the online browser Haplotter23 (http://pritch.bsd.uchicago.edu/data.html). The web page displays the results for positive selection in genes or genomic regions of the human genome using the HapMap data. This program provides plots of two parameters: iHS, the integrated haplotype score, which measures positive selection on the ancestral and derived alleles via the decay of extended haplotype homozygosity24 and Fst, a measure of the degree of population differentiation based on pairwise SNP frequency comparisons. Haplotter also identifies iHS and Fst scores considered to be statistically significant.

Results

A total of 191 SNPs were successfully genotyped, and of those, a final 159 SNPs (83%) with a MAF of at least 10% were included in the haplotype analysis of the 20 genes with LD blocks larger than 60 kb. We observed a total of 21 blocks complying with these criteria, two in CDK6 and one in each of the other genes. These 20 genes are located across 12 different chromosomes (2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13 and 17), have an average size of 101 kb (range: 61–199 kb) and together span 2.2 Mb of the genome (Table 1). The average density of SNPs genotyped in these genes with an allele frequency higher than 10% was one SNP for every 9.7 kb. The LD structure of each gene is shown in Supplementary Figure 1, and a full list of these 20 genes and selected SNPs is provided in Supplementary Table 1.

Table 1 Summary of the haplotype data in the 20 candidate genes

To understand more about these 20 genes with large LD blocks, we looked into their Gene Ontology (GO) classifications,21, 25 and observed that they cover a broad range of biological processes. Nevertheless, the most overrepresented category was DNA repair with eight (ATM, BRCA1, BRCA2, ERCC6, MLH1, MSH3, RAD54B and XRCC4) of the 20 genes with long LD blocks classified as being involved in ‘regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism’ vs none of the 101 other genes. This difference was highly statistically significant (unadjusted P=1.2 × 10−6), even after the most conservative corrections for multiple testing (P=2.8 × 10−4, assuming all 228 categories observed are independent).

Haplotype structure

We observed a very high correlation in haplotype frequencies between the Spanish control sample (N=845), and HapMap CEU sample (R2=0.96, Supplementary Figure 2) as previously reported14 and further comparisons were therefore not made between these two samples. The correlation was much lower when the Spanish sample was compared to that from the other two HapMap populations, R2=0.38 for JPT, 0.32 for CHB and 0.038 for YRI (Supplementary Figure 2).

An average of three haplotypes per block represented 89% of all haplotypes with frequencies greater than or equal to 10% (Table 1). We compared the individual haplotypes observed in the Spanish, European – CEU, Asian and Yoruban samples considering only those observed in at least one sample with a frequency greater than 5% (Figure 1).

Figure 1
figure 1

Distribution of haplotypes with frequency >5% across populations.

The ancestral allele for each of the SNPs in this study is shown in Supplementary Table 1. We identified 14 ancestral haplotypes among the 21 LD block fragments (67%) present in either the Spanish sample or the HapMap samples. The average frequency of this ancestral haplotype, when observed, was 15% (SD=6.8%) among Spanish controls, and varied in the three non-CEU HapMap samples, being 36% (SD=25%) in YRI, 36% (SD=25%) in CHB and 32% (SD=26%) in JPT. This information is summarised in Table 2a and detailed information highlighted in Supplementary Table 2.

Table 2 Summary information for each LD block regarding the ancestral haplotype (based on the chimpanzee sequence) (a) and the most common haplotype in Spaniards and its frequency in the other populations of HapMap (b).

The average frequency of the most common haplotype in each block in the Spanish control sample was 48% (SD=13%) and that for the same haplotype in YRI, JPT and CHB was 48% (SD=19%), 62% (SD=19%) and 60% (SD=18%), respectively. Results for the Spanish and CEU samples were so similar (data not shown) that only the former are reported here. The most common Spanish haplotype was also the most common in only 19% (4/21) of YRI blocks. In the case of the JPT and CHB samples, the most common haplotype coincided with the European counterpart 62% (13/21) and 67% (14/21) of the time, respectively (Table 2b and Supplementary Table 2).

The Yin–Yang haplotype pair was observed in the Spanish sample in 20 of the 21 blocks (90%). We did not observe the Yin–Yang pair in the APAF1 gene. We included these 20 blocks in further analyses of Yin–Yang haplotypes (Table 3a and b). The Yin haplotype was generally the most common haplotype among the Spanish controls (16 (80%) of the 20 blocks) with an average frequency of 45% (SD=15%). The mean Yin haplotype frequency was 13% (SD=12%) in the YRI sample and 39% (SD=28%) and 37% (SD=28%) in the JPT and CHB samples, respectively.

Table 3 Summary information regarding Yin–Yang haplotypes in the different populations sampled: Spanish and YRI (Yoruba), JPT (Japanese) and CHB (Chinese) from HapMap. (a) Yin pair (b) Yang pair

The Yang haplotype was the second most common in Spaniards for 12 (60%) of the 20 blocks, and the third in frequency for 7 (35%). The Yin–Yang pair was made up by the second and fourth haplotype in frequency for one gene (BRCA2). The average frequency of the Yang haplotype in Spaniards was 23% (SD=9.0%) very similar to that in the other sampled populations, 14% (SD=16%) in YRI, and 19% (SD=21%) and 20% (SD=21%) in JPT and CHB, respectively. However, no Yin–Yang pairs were observed in blocks or fragments therein of nine (45%) genes among YRI; five (25%) among JPT and four (20%) among CHB (Table 3a and b).

Where Spanish Yin–Yang haplotypes were present, they accounted for an average of 68% (SD=19%) of all haplotypes in the three non-African samples. These same two haplotypes comprised an average of 57% (SD=31%) and 57% (SD=30%) in JPT and CHB, respectively. They accounted for only 27% of all YRI haplotypes, on average (SD=17%). For only three genes (ALDH2, BRAF and SOS1), the ancestral haplotype was identical to one of the Spanish Yin–Yang haplotype pair, in all cases, it was the Yang haplotype.

Recent positive selection

We did not observe significant evidence of positive selection (iHS>2.0. across a substantial portion of the gene) for any of the 20 genes with long LD blocks. However, some genes had small areas with elevated iHS scores: XRCC4 (iHS values from 1.9 up to 2.4 in YRI), APC (iHS values from 1.7 to 2.1 in ASN), CDK6 (iHS values from 1.5 up to 2.0 in YRI), MSH3 (iHS values from 1.0 to 1.9 in CEU), RAD54B (iHS values from 1.1 up to 1.7 in YRI) and RB1 (iHS values from 1.1 up to 1.75 in YRI).

The average Fst level of autosomal SNPs is 0.15,26 which is within the range of previously published Fst estimates (0.05–0.15) for neutral (nonselected) genes and SNPs.26, 27 We found evidence of differences between population samples for ALDH2 (Fst=0.6 for CEU vs ASN), PICK3CB (Fst=0.6 for CEU vs ASN), RB1 (Fst=0.5 for both CEU vs YRI and for YRI vs ASN,), XRCC4 (Fst=0.4 for CEU vs ASN) and APC (Fst=0.4 for YRI vs ASN). Fst and iHS data for all 20 extended LD genes is provided in Supplementary Figure 3.

Association study

Comparison of haplotype frequencies in the 864 breast cancer patients and 845 healthy controls gave some evidence of association with breast cancer (unadjusted P-value <0.05) for haplotypes in seven genes (APAF1, CDK6_1, EGF, ERCC6, MAPK14, MSH3_1 and SOS1) (Table 4). However, none of these associations would be statistically significant after consideration of the multiple tests performed. All except APAF1 (for which Yin–Yang haplotypes were not observed) had the Yin haplotype as the most common (and therefore the reference) haplotype. The putative associated haplotype was the Yang haplotype for only one of these (MAPK14). The ancestral haplotype was neither the most common haplotype, nor that putatively associated with breast cancer risk for any of these seven genes. A full list of haplotypes per gene and their frequencies in Spanish cases and controls is given in Supplementary Table 2.

Table 4 Summary of the most significant findings from the case–control association study with haplotypes in 20 studied genes

Discussion

The aim of our study was to test whether cancer-related genes with long LD block structure are subject to some sort of selection and could contribute to breast cancer aetiology. To address this, we selected and analysed the 20 genes with LD blocks larger than 60 kb among 121 cancer-related genes.

Basic functions such as ‘regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism’ related to DNA repair were enriched in these 20 genes compared to the other 101 cancer-related genes (P=2.8 × 10−4). We believe that these functions in the cell had to be carried out by highly ‘regulated’ and ‘controlled’ proteins, and that selection would therefore act against variation in the genes that encode them. Such genes would be presumably under purifying selection and not under the type of positive selection that can be detected by this method. The few haplotype combinations that are present should be fully functional. This would explain the low recombination rates observed in these genes.

The results of our comparison of haplotype distributions across populations are generally in concordance with those obtained by Gabriel.9 The highest percentage of population-specific haplotypes was present in African samples (HapMap_YRI) which is in agreement with Africans being the most diverse population sampled. Moreover, a great similarity in haplotypes and their frequencies was observed between the Spanish and Asian populations; however, both populations had a greater proportion of unique population-specific haplotypes than observed in Gabriel.9

We found that an average of three haplotypes per gene represented over 90% of the total haplotype distribution. That is, the majority of these haplotypes have high frequencies. The most common haplotype in each block had a frequency of over 50% in all the sampled populations. Our results are consistent with those of other groups in terms of finding that genes with long LD blocks have reduced diversity of haplotypes.28

Since the sequence of the chimpanzee genome was published in 2005, genetic comparisons between chimp and human have become widely possible.3 We were able to determine the ancestral allele for all the SNPs in the 20 genes included in this study, which is not surprising considering that human and chimp genomes are 99% identical.4 The ancestral haplotype was inferred for 21 long (>60 kb) LD blocks in the 20 genes. However, it was not observed in about a quarter of the LD blocks. It was most often observed, and with higher frequency, in YRI, next most often in Asians (JPT and CHB) and least often in Spaniards. This finding is consistent with Africans being the most genetically diverse population. It also corroborates the out-of-Africa hypothesis of human populations29 as well as human demographic history in which the ancestral African population has maintained a larger effective population size and has had more time for recombination and mutations to reduce LD. On the contrary, the HapMap CEU and Spanish samples had the highest number of derived haplotypes.

We found that a Yin–Yang haplotype pair was present in more genes in the non-African samples than in the YRI sample, being highest in the European-Spanish population. The Yin–Yang pairs constituted a substantial fraction of the total haplotype diversity. The average combined frequency of the Yin–Yang pair in the four populations sampled constituted an overall 62% (SD=25.08%) of the haplotype diversity seen in these blocks. This percentage is almost double that reported by Zhang.13 That is, it appears that in general, Yin–Yang haplotypes are more prevalent, and haplotypic diversity is lower in genes with large LD blocks compared with nonselected genomic regions although this is less the case in older (African) populations. One possible explanation for this is that regions of high LD were naturally enriched with this phenomenon and follow a neutral evolutionary model, suggesting that Yin–Yang haplotypes are genetic signatures that emerged prior to the African diaspora.13 Another possible explanation is that they represent a selection bias, thus, when selecting candidate SNPs with high pairwise R2 (>0.8), the Yin–Yang pairs naturally appear more often in the specific population from which they have been selected.

To evaluate the sensibility to detect iHS and Fst with HapMap data, we obtained these values for the lactase (LCT) and SCA2 (ATXN2) genes, both in regions with high LD and both known to be positively selected, but not related to cancer.8, 30 Using Haplotter,23 we observed significant values of iHS over a large portion of the lactase gene (iHS>3 for about 1.5 MB) and elevated Fst for SCA2 (ATXN2) (Fst>0.6 for about 1.5 MB).

When the genes of this study were screened using the same program, we detected differences in SNPs frequency across populations (Fst>0.4) for four genes (APC, CDK6, RB1 and XRCC4. It was also detected for ALDH2, but this gene is located in the same genomic region as SCA2 (ATXN2).8 However, for each gene, the elevated Fst was observed as a single peak that did not extend across the genomic region. Such long regions of high LD could have been subjected to evolutionary forces such as selection in humans. However, in the present study, we found no clear evidence of positive selection having acted on our ‘high-LD’ candidate cancer genes using the method of Sabeti et al.24

Voight et al23 used the same method to identify positive selection acting on genes involved in chemosensory perception, olfaction and fertilisation. However, these functions are very different to those of our 20 genes, which are involved in more processes such as basic cellular signal transduction, DNA repair and cell cycle. It may be that these latter functions are too basic or that the role of these genes may tend to act later in life (ie, after reproduction), so that any positive selection does not act upon them.

We hypothesised that the genes with long LD blocks, and Yin–Yang and ancestral haplotypes contained therein in particular, might be more likely to be involved in breast cancer predisposition. An overrepresentation of mutations in BRCA1 has been observed in the Yang haplotype relative to the most common Ying haplotype.31 In addition, a protective effect against breast cancer risk has been shown for the minor allele at an SNP, which occurs on the Yang haplotype in ERCC4.15 Furthermore, a putative role of the ancestral allele in six cancer-susceptibility SNPs has been suggested based on a review of selected association studies.4 For one of the SNPs, R72P in TP53, the most common human allele, Arg72, is the derived allele whose frequency ranges from 55 to 92% among different human populations, and the ancestral allele is Pro72, both alleles have been associated with cancer risk in different studies.32, 33, 34, 35, 36, 37 In our study, we found no evidence of association with breast cancer risk for haplotypes in any of the 20 genes studied after adjustment for multiple testing. For just one of the seven blocks with haplotypes that had unadjusted P-values less than 0.05 (MAPK14), the reference (most common) and putative risk-associated haplotypes constituted the Yin and Yang, respectively in the Yin–Yang pair. The ancestral haplotype was neither the reference nor the best candidate associated haplotype for any of these blocks. In summary, we found no evidence that Yin–Yang haplotypes nor ancestral haplotypes are more likely to be associated with breast cancer risk. This may be due to our study lacking power to detect association.

In conclusion, we detected a reduced haplotype diversity in genes with elevated LD over a long distance (>60 kb), with an average of three haplotypes per gene accounting for >90% of the diversity, two of those being the pair Yin–Yang in most of the cases. Moreover, the most common haplotype (most of the time the Yin) had an average frequency of around 50%. In addition, we observed the ancestral haplotype in the Spanish, JPT, CHB and YRI populations for 65% of the genes at a mean frequency of 20% in the Spanish and about 40% in the other HapMap populations. Finally, we found no evidence that positive selection has acted on these 20 genes nor that haplotypes formed by SNPs in them are associated with breast cancer.