Introduction

The Haplotype Mapping Project (HapMap) has created a significant resource for human genetic studies by mapping the patterns of linkage disequilibrium (LD) in the human genome, and by creating a resource for LD-based marker selection for genome-wide association studies.1 In the HapMap project, samples from representative European, African and Asian populations were genotyped, and the genotype data were published online for free use by the scientific community. The latest HapMap release (no. 22 of March 2007), includes more than 3.8 million validated single nucleotide polymorphisms (SNPs). An important question for the application of the results from the HapMap project has been how representative the populations analyzed would be for other populations of interest worldwide. This question has been addressed in a number of studies. LD patterns and tagSNP transferability have been compared in samples from diverse geographical regions of the world,2, 3, 4 or investigated more closely in one or more specific populations.5, 6 Gu et al7 recently reported an investigation of 517 kb of the human genome distributed over 10 loci and found that popular definitions of haplotype blocks were highly variable between samples from different parts of the world, while tagSNPs were better transferable between populations. De Bakker et al3 investigated 25 regions of approximately 100 kb length. They found that for samples from individuals self-described as ‘white’ from Hawaii and individuals from the Botnia region in Finland, loss of coverage in tagging when using HapMap-defined tagSNP sets was no larger than sampling variation within the HapMap samples. A less clear result was obtained when Mueller et al5 analyzed four regions spanning 749 kb in total in populations across Europe from Italy to Estonia. Two of the regions were well represented by the HapMap CEU trios, whereas the data from the two other regions were more difficult to interpret.

In our study we compared how representative the HapMap data were for six samples from five European populations. We found that tagSNPs defined in the HapMap CEU sample represented our European samples well, with the proportion of tagged SNPs ranging from 87% in the Australian sample to 95% in the Kuusamo sample. Allele frequencies were fairly similar in the samples, with the exception of the Finnish sample from the Kuusamo region, where we observed overall differences in allele frequencies compared to the other populations.

Materials and methods

Study populations/DNA samples

Nuclear family samples with one child and two parents from Australia, Finland (two samples), the Netherlands, United Kingdom and Sweden were genotyped. The Australian trios were selected from an adolescent twin family study.8 The Dutch sample consisted of families who are registered with the Netherlands Twin Register.9 The UK adult twin registry (http://www.TwinsUK.ac.uk) consists of volunteer Caucasian twins from all over the United Kingdom and born in the United Kingdom. This is shown to be broadly representative of the UK population.10 The Finnish samples were from the general Finnish population, derived largely from early settlement Finland, and a sample from Kuusamo, a small subisolate within late settlement Finland.11 The Swedish samples were from the Uppsala Family Study12 and originate from the central region of Sweden. The samples were selected at random from the larger population cohorts, but with the practical limitation that samples were avoided where the amount of extracted DNA available for study was limited. The DNA samples had been extracted using standard methods, for Australia, using the salt extraction method.13 Eleven of the Dutch DNA samples were amplified by multiple displacement amplification using the GenomiPhi DNA amplification kit (GE Healthcare, Uppsala, Sweden) prior to genotyping. The genotyped samples are summarized in Table 1.

Table 1 The genotyped samples in the study

Genomic region and SNP selection

The genomic region considered in this study was a 1.47 Mbp region on chromosome 4q22, containing the glutamate receptor, ionotropic, delta 2 gene (GRID2). Out of the available SNPs genotyped in the HapMap project, 197 SNPs with an average spacing of 8 kb were genotyped. One of these SNPs was located in the coding regions of GRID2, but was synonymous, and the remaining SNPs were located outside exon-coding regions of GRID2. The allele frequency distributions for the populations are graphed in Figure 1.

Figure 1
figure 1

Minor allele frequency distributions for the genotyped population samples.

Genotyping

Primers for PCR and minisequencing (single-base primer extension) were designed using the Autoprimer software (http://www.autoprimer.com, Beckman Coulter Inc., Fullerton, CA, USA). The SNPs were genotyped using the GenomeLab SNPstream system14 using reagents provided by the manufacturer (Beckman Coulter). In brief, the genotyping protocol involved 12-plex PCR amplification and genotyping of SNPs with the same nucleotide variation. PCR products were cleaned up using exonuclease I and shrimp alkaline phosphatase, and genotyped using 12-plex minisequencing with fluorescent terminating nucleotides, followed by capture of the extension products by hybridization to complementary tag-oligonucleotides immobilized in 384-well microtiter plate wells. Incorporated fluorescence was detected using a CCD camera of the SNPstream instrument and the genotypes were called by cluster analysis using the GetGenos software (Beckman Coulter). Of the 197 SNPs selected, 11 failed in genotyping due to technical reasons and 13 were monomorphic in the whole sample set. Thus, 173 SNPs were included in the subsequent analyses (Supplementary Table 1). Checks for Hardy–Weinberg equilibrium and Mendelian inheritance were used as quality controls using an in-house developed software and database system (SNP Quality Analysis Tool). The average sample success rate was 96.4% and the reproducibility was 99.98% according to duplicate determination of 21% of the genotypes in two independent experiments.

Data analysis

The Haploview15 v3.32. software was used for D′ and r2 calculations and for analysis of the LD patterns in each of the genotyped populations. The tagger16 implementation in Haploview was used for tagSNP selection using pair-wise tagging and an r2 cutoff value of 0.8. TagSNPs were defined in the HapMap CEU data set and their performance was assessed in the six study population samples. To test for differences in allele frequencies, Fisher's exact test was used to compare the genotype data from the parents in the trio families using the R software package version 2.2.1 (www.r-project.org). The population assignment of each individual was altered in a permutation test with 7000 permutations to determine a global 5% error level for the allele frequency comparisons in the HapMap against all others comparison.

Principal component analysis (PCA) was used to project SNP allele frequencies in the population samples into two dimensions to summarize similarities and differences. In the PCA, populations were used as observations and the SNP allele frequencies were used as variables. The data were centered but not scaled.

The Gabriel et al blocking algorithm17 was used to annotate areas of high D′ across the 1.47 Mbp GRID2 genomic region. Four areas with different patterns of r2 in the blocks defined by this algorithm were selected for haplotype reconstruction and comparison of haplotype distributions in the GRID2 region. The PHASE version 2.2.1 software was used for haplotype reconstruction and testing for differences in the haplotype distributions between the HapMap sample and the other population samples.18, 19 A thousand permutations were used in the PHASE software to calculate the P-values.

Results

To compare the LD pattern, haplotype structure and tagSNP efficiency in a 1.47 Mbp genomic region containing the GRID2 gene between five European populations and the HapMap CEU data, we analyzed 173 SNPs in this region. Figure 2 is a schematic view of the GRID2 genomic region, the positions of the SNPs and the four selected regions with high LD used for haplotype reconstruction. The investigated genomic region contains both regions with a strong block-like LD structure and more fragmented areas (Supplementary Figures 1 and 2).

Figure 2
figure 2

A schematic illustration of the chromosomal region containing the GRID2 gene on chromosome 4 (top), the single nucleotide polymorphism (SNP) positions in this region (middle) and the example regions 1–4 used for haplotype reconstruction (bottom).

Allele frequencies

The minor allele frequency (MAF) distributions in each of the genotyped samples are graphed in Figure 1. When calculated across all six populations, the first quartile of the MAFs was 0.23, the median and mean were 0.34 and 0.31, respectively, and the third quartile of the MAFs was 0.41. The minimum and maximum values for the MAFs were 0.018 and 0.49, respectively. To determine whether the allele frequencies of the SNPs differed between the population samples, the allele frequencies of the parents in the trio families were subjected to pair-wise comparisons between populations and the HapMap CEU data.

Permutation testing indicated that the global 5% error level for a single comparison was P<1.30 × 10−4, when the allele frequencies in the HapMap sample were tested against the allele frequencies in the other six population samples. This threshold was only reached for the SNPs rs1912718 and rs6854968 in the comparisons between the HapMap and the Finnish Kuusamo samples. The allele frequencies for the SNP rs1912718 were 0.48 in the HapMap sample and 0.76 in the Kuusamo sample. The corresponding numbers for the SNP rs6854968 were 0.34 and 0.62, respectively. An all-against-all comparison was also performed between the populations, including the HapMap sample, to search for overall differences in allele frequencies between populations. The numbers of differences between populations at a level of P<0.05 are tabulated in Table 2. All samples except that from Kuusamo, Finland had a similar number of allele frequency differences. The Kuusamo sample had a larger number of allele frequency differences when compared to the other samples, and for every population, the comparison with Kuusamo was the one that yielded the highest number of differences. Given the number of SNPs tested, the expected number of random results below 5% would be between 8 and 9 for each combination of samples. The sample from Kuusamo was the only one where these numbers differed clearly from those expected. For all comparisons with Kuusamo, the number of differences was more than threefold larger than expected by chance.

Table 2 The number of pair-wise allele frequency differences between populations at P<0.05

PCA was used to project the allele frequencies of the 173 SNPs into two dimensions. This can be visualized as positioning a plane (the first and second component) in 173-dimensional space to describe as much of the variance in the data as possible. When projected onto the plane, the coordinates of the data points are converted to the new two-dimensional coordinates. Again, the Kuusamo sample separated to the largest extent from the rest of the populations along the first principal component, that is the x axis in Figure 3, while the other populations clustered together. The second principal component, that is the y axis in Figure 3, separates the Finnish sample from the other populations, indicating that in part a different set of SNPs were responsible for this separation than for that of the Kuusamo sample. To determine which of the SNPs were responsible for the separation of the samples, the loadings plot (Supplementary Figure 3) was used. This plot shows which of the original variables (SNP allele frequencies) influenced the principal components. For example, the SNPs plotted to the extremes in the x axis in the loadings plot had the strongest influence on the x axis (first component) in the score plot of the populations. In this case the SNPs 104–110 (Supplementary Table 1) together with SNP 46 had the strongest effect on the position along the first component in the samples. The SNPs 104–110 are located at the end of a region of high LD adjacent to a recombination hot spot described by the HapMap project, and the position of SNP 46 is in a region of low LD between two regions with high D′ values.

Figure 3
figure 3

Principal component score plot of the first two principal components of the allele frequencies in the populations. The plot is generated by positioning a plane in the cloud of data points representing the single nucleotide polymorphism (SNP) allele frequencies so that the maximum variation is captured when the data points are projected onto the plane. In effect, the plot gives a reduction in dimensions from 173 representing the number of allele frequencies from different SNPs to two, while maintaining as much information as possible.

Linkage disequilibrium and tagSNP efficiency

The average levels of LD were similar between the population samples, with the highest average D′ value for the Kuusamo sample and the highest average r2 value in the UK sample (Table 3). The patterns of D′ and r2 across the region were similar in all the samples, with smaller differences in the general level of LD in some regions (Supplementary Figures 1 and 2, HapMap D′ and r2 plots). To evaluate the suitability of using a set of tagSNPs defined based on the HapMap data in our study populations, a set of tagSNPs was selected using a pair-wise r2 cutoff value of 0.8, which resulted in a set of 63 SNPs. This set of tagSNPs was then applied to the other populations, and its ability to capture the genetic variation was determined. About 90% of the SNPs in the five European populations were represented at r2>0.8 using the HapMap tagSNP set. In the Kuusamo sample we observed the best capture of variation, with 95% of markers represented. The lowest efficiency of the tagSNPs was observed in the Australian sample, where 87% of the SNPs were captured (Table 4). A similar pattern was seen for the number of native tagSNPs defined specifically in each of the populations to capture the variation of all other SNPs. The Kuusamo sample required only 58 SNPs to tag all 173 SNPs, whereas in the Australian sample, 73 SNPs were required.

Table 3 Summary statistics for pair-wise r2- and D′-values between SNPs in the full sets of SNPs and between SNPs within 50 kb of each other
Table 4 Performance of HapMap tagSNPs in the HapMap samples and in the study population samples

The distribution of r2 between SNPs and the HapMap tagSNPs in Figure 4 illustrates the slight differences between populations. In the Australian sample with a slightly lower tagging efficiency, this difference was due to a wider distribution of values bringing a larger number of markers below the r2=0.8 threshold (22 SNPs for the Australian sample versus 9 for the Finnish Kuusamo sample). No increase in SNPs with very low LD to the HapMap tagSNPs was observed in the less effectively tagged samples.

Figure 4
figure 4

Distributions of pair-wise r2-values between best HapMap tagSNPs and all single nucleotide polymorphisms (SNPs) in the genotyped region.

Haplotypes

To illustrate the range of variation in haplotype frequencies between different population samples, four regions defined by the Garbriel et al algorithm17 within the 1.47 Mbp region were selected for haplotype reconstruction. They were all regions of high D′ to allow the haplotype reconstruction software to perform optimally to increase the confidence of detecting real differences in haplotype frequencies, rather than observing shortcomings of the software used.19 In Figure 5, the haplotypes with frequencies>1% in the study populations are shown in a separate panel for each region. In most cases differences in haplotype frequencies were smaller than 10% between the HapMap and other populations, but in some cases they differed as much as 20% (Figure 5).

Figure 5
figure 5

Haplotype frequencies>0.01 in the example regions, ordered by their frequency in the HapMap sample. Haplotype distributions differing from the HapMap data at P<0.05 are marked with an asterisk. Region 1, single nucleotide polymorphisms (SNPs) rs9993173–rs7683035;region 2, SNPs rs1456359–rs6811974; region 3, SNPs rs1433654–rs1433648; region 4, SNPs rs4692986–rs9993013.

Differences in the haplotype distribution between the HapMap sample and the other six population samples were tested pair wise using the case–control functionality in PHASE.18, 19 This software compares the haplotype distributions in two population samples, and also takes into account the degree of similarity between the observed haplotypes.

The null hypothesis is that the two haplotype distributions in the tested populations are random selections from the same set of haplotype frequencies. This is compared to the alternative hypothesis, that the second population is more similar to itself, than the first population, which in our case is the HapMap data. Three of these comparisons were significant at a P<0.05 level. These were the comparisons of the HapMap sample with the Australian sample in region 4 (P=0.039), the Kuusamo sample in region 3 (P=0.002) and the Finnish sample from the general population in region 1 (P=0.032). The haplotype distributions in the population samples that differed at P<0.05 from the HapMap sample are marked with an asterisk in Figure 5. Only the result in the Kuusamo sample would be significant after a Bonferroni correction for the multiple (6 × 4) tests with a critical P-value of 0.0021. This result appears to have been mainly driven by a haplotype present at a frequency of 0.11 in the Kuusamo sample, while absent in the HapMap sample. This haplotype also contained several sequence differences compared to the haplotypes present in the HapMap sample.

Discussion

Genome-wide association studies with SNP markers have recently become a standard tool in the search for genes underlying complex diseases.20 The data from the HapMap project have served as the key resource for selecting the SNPs for the panels used in genome-wide genotyping studies.21, 22 Information on LD is also required to use a tagSNP strategy for custom design of SNP panels that have maximal power to capture the genetic variation in a specific genomic region of interest. The aim of our study was to investigate how well the HapMap European data represent five specific European populations, from which SNP genotype data will be subjected to combined/pooled association analysis. Our results from a careful investigation of SNPs in a 1.47 Mbp genomic region using a set of SNPs located at an average spacing of 8 kb are reassuring.

We observed no indications of differences in allele frequencies between the studied populations and the HapMap data, with the single exception of the Kuusamo sample. This is somewhat in contrast to the result by Willer et al,23 who found evidence for population differences in SNPs allele frequencies between the HapMap sample and their Finnish sample. They investigated a 17.9 Mb region on chromosome 14 in a more general Finnish population. However, their sample size was larger than ours, and thus smaller differences would be detectable than in our study. The Kuusamo isolate has an exceptionally well-documented history, and was founded by just 78 settlers 310 years ago. Extended LD has been demonstrated in the Kuusamo isolate, using microsatellite markers.24 Our study was designed to include a similar number of samples as the HapMap project. Permutation testing indicated that only two of the comparisons were significantly different at a global P<0.05, as a result of the many tests involved in testing the HapMap samples against the six other population samples for each SNP, as well as the limited sample sizes. When examining the marker set as a whole, the Kuusamo sample had more differences in SNP allele frequencies than expected by chance. Our conclusion is that allele frequency estimates based on the HapMap data are as good as can be expected from the sample sizes in all populations, with the exception of the Kuusamo sample where extra caution would be advisable based on the observed differences.

The performance tagSNP defined by the HapMap project in the studied European populations was good. Between 87 (Australia) and 95% (Finland, Kuusamo) of the SNPs were captured at r2>0.8. This is slightly higher than the 89% reported by de Bakker et al3 as the sampling variation of the CEPH parent-offspring trios from Utah (HapMap CEU) and the 78 and 83% for the self-described white Hawaiian and the Finnish, Botnia samples, respectively. One limitation in our study is that we do not sample other SNPs than those genotyped by the HapMap study, where especially rarer SNP variants could be less well represented. Despite this limitation we conclude that the HapMap CEU sample represents our samples of interest well and is of value for tagSNP selection. Although we observed differences in allele frequencies between the Finnish Kuusamo sample and the other populations, the Kuusamo population was the one best captured by the HapMap tagSNP set. This surprising finding seems to be the result of the fact that the Kuusamo sample is the easiest to tag, as only 58 native Kuusamo tagSNPs were needed. For comparison, 63 native HapMap tagSNPs were required to tag the HapMap set, which was applied to the other populations. These results are similar to those recently reported by Service et al.25 They investigated the performance of HapMap tagSNPs in several population isolates, including the Kuusamo isolate, which was also included in our study presented here. Despite factors such as genetic drift that could have a strong effect in population isolates, they found that HapMap tagSNPs worked well in several population isolates, including the one from Kuusamo, Finland, for 2215 markers on chromosome 22. In their panel 84% of non-tags were captured at r2>0.8.25 Our higher representation of Kuusamo SNPs by HapMap tagSNPs could be an effect of sampling as well as a larger proportion of more common SNPs in the panel. Overall, our work agrees with other previous studies such as the one by Willer et al23 that concluded that HapMap tagSNPs represented their Finnish sample well for tagSNP selection despite indications of allele frequency differences, and the study by Stankovich et al26 that analyzed regions on chromosomes 6q and 10q in Australian multiple sclerosis cases, and found that their Australian samples were similarly well represented.