Introduction

Genetic association studies make use of patterns of linkage disequilibrium (LD) between genetic polymorphisms for efficient genotyping (Carlson et al. 2004; Service et al. 2006). LD patterns reflect the ancestry of a population and vary considerably throughout the genome (Reich et al. 2001; Sawyer et al. 2005; Plagnol and Wall 2006). The International HapMap Consortium is studying and cataloguing DNA sequence variation and characterizing these patterns of LD across the genome (The International HapMap Consortium 2005), allowing correlated single nucleotide polymorphisms (SNPs) to be excluded from genotyping, but considered in analysis. The HapMap is made up of four panels of dense haplotype maps for individuals characterized as Han Chinese from Beijing, China (CHB), Japanese from Tokyo, Japan (JPT), Caucasians from Utah, USA, with northern and western European ancestry (CEU), and Yoruban of Ibadan, Nigeria (YRI).

The HapMap is an important resource for choosing tag SNPs in disease association and population studies (The International HapMap Consortium 2005). However, worldwide population variation is not completely characterized, and an essential question is whether tag SNPs chosen using HapMap panels will adequately capture patterns of genetic variation in other populations (Weale et al. 2003; Nejentsev et al. 2004; Evans and Cardon 2005; Ke et al. 2005; Mueller et al. 2005; Ramirez-Soriano et al. 2005; Gonzalez-Neira et al. 2006; Huang et al. 2006; Montpetit et al. 2006; Ribas et al. 2006; Willer et al. 2006). Furthermore, for populations similar to those genotyped in the HapMap project, HapMap data may be used to directly predict genotypes of non-tag SNPs for analysis in association studies (Eyheramendy et al. 2007; Paschou et al. 2007). Previous studies observed that the HapMap CHB and JPT panels have very similar patterns of LD and could act as a proxy for other geographically related populations (Beaty et al. 2005; Lim et al. 2006; de Bakker et al. 2006; Conrad et al. 2006; Mahasirimongkol et al. 2006; Yoo et al. 2006). However, data are currently unavailable to assess the effectiveness of using the existing HapMap data to guide SNP selection and interpretation for samples from the Cebu Longitudinal Health and Nutrition Survey (CLHNS) cohort from metro Cebu in the central Philippines (Cebu Study Team 1991; Adair 2004). This study assesses the advantages of having HapMap data from two related Asian panels to evaluate whether the combined CHB and JPT panels would more effectively capture genetic variability in Cebu Filipinos than the CHB or JPT panels alone. In addition, this study develops the most efficient criteria for selecting tag SNPs from the HapMap panels for future genetic association studies in Cebu Filipino samples.

To address these issues, SNPs from within the ten HapMap ENCyclopedia Of DNA Elements (ENCODE) reference regions were used (The ENCODE Project Consortium 2004). These regions were re-sequenced in 48 unrelated individuals (8 CHB, 8 JPT, 16 CEU, and 16 YRI) for SNP discovery and reflect the density of SNPs in the genome more accurately than other regions in HapMap. The SNP density in these ENCODE regions is higher than the remainder of HapMap (The International HapMap Consortium 2005). The similarity of the HapMap samples to 80 Cebu Filipino samples was assessed, using allele frequency estimates, pairwise LD (r 2), and haplotype frequency estimates as measures of similarity. Furthermore, the efficiency of using tag SNPs selected from the HapMap Asian panels for capturing genetic variation in Cebu Filipino samples was studied.

Materials and methods

Samples

Eighty unrelated Cebu Filipino individuals were randomly selected from a cohort of healthy women from the CLHNS (www.cpc.unc.edu/projects/cebu). Informed consent was obtained from all individuals and the study protocol was approved by the University of North Carolina Institutional Review Board for the Protection of Human Subjects.

Genomic DNA was isolated from peripheral blood lymphocytes using automated and manual DNA extraction methods (Puregene, Gentra) by the University of North Carolina, Chapel Hill BioSpecimen Processing Facility. Centre d’Etude du Polymorphisme Humain (CEPH) DNA samples were obtained from Coriell (Camden, NJ).

HapMap genotype data were obtained from the HapMap database (www.hapmap.org) for all available unrelated individuals, including 45 CHB, 44 JPT, 60 CEU parents of trios, and 60 YRI parents of trios. For some analysis, the CHB and JPT samples were combined (indicated as CHB + JPT) (The International HapMap Consortium 2005).

SNP selection and genotyping

To represent the overall complexity of the genome, the central 40-kb region from within each of the ten 500-kb ENCODE regions that have been used for SNP discovery and dense SNP genotyping was chosen for this study (Table 1). SNPs were selected if they were polymorphic (minor allele frequency, MAF, >0) in the HapMap CHB, JPT, or CEU panels.

Table 1 Number of SNPs successfully genotyped by population and region

Of the 883 SNPs that met these criteria, 215 were eliminated based on Illumina design score (calculated December 2005). One SNP identified by re-sequencing region ENr213 in Cebu Filipino samples (see below) was included, resulting in a total of 669 SNPs that were genotyped in the Cebu Filipino samples. SNP genotyping was performed at the Mammalian Genotyping Core at the University of North Carolina, Chapel Hill, using the Illumina GoldenGate (Illumina Inc., San Diego, CA) genotyping assay (Gunderson et al. 2004). Of the 669 SNPs attempted, 36 SNPs were excluded based on poorly defined clusters (n = 28), genotyping completeness <90% (n = 3), or inconsistency with Hardy-Weinberg equilibrium (p < 0.001; n = 5). Six additional SNPs were excluded because of two or more genotype discrepancies between six CEPH DNA samples and equivalent HapMap CEU genotypes. SNPs were also evaluated for two or more genotyping discrepancies between seven duplicate samples; however, no SNPs needed to be dropped based on this criterion. The genotyping success rate of the final 627 SNPs was 99.9%, and the discrepancy rate was 0.02%. Of these 627 SNPs, 501 (80%) were polymorphic (MAF >0) in Cebu Filipino samples. The average marker spacing of these 501 polymorphic SNPs was 1 SNP/798 bp.

Statistical analysis

Tests for consistency of genotype distributions with expected Hardy-Weinberg equilibrium proportions were calculated using standard Pearson’s χ 2 statistics. Only markers with a MAF ≥0.05 in HapMap panels were analyzed in Cebu Filipino samples. SNPs were matched for the reference allele between all HapMap panels and Cebu Filipino samples. Fisher’s exact tests were used to test for allele frequency differences between pairs of samples. Pair-wise LD (r 2) values were calculated using Haploview (Barrett et al. 2005; http://www.broad.mit.edu/mpg/haploview) for adjacent pairs and all pairs of SNPs in each region. Haplotype blocks were defined in Haploview for each HapMap panel based on the default block definition (Gabriel et al. 2002). Identical blocks from each HapMap panel were defined in the Cebu Filipino samples for comparison. Using Haploview, haplotype frequencies were estimated in each haplotype block for every population. Haplotypes with a frequency >0.01 were evaluated in the HapMap panels. Haplotypes not observed in the Cebu Filipino samples were assigned a frequency of zero. Spearman’s correlation coefficients were calculated for all comparisons between Cebu Filipino samples and HapMap panels.

In order to evaluate the efficiency of HapMap to choose tag SNPs for Cebu Filipino samples, tag SNPs from HapMap panels were selected using Tagger in pairwise tagging mode with other settings at default values (de Bakker et al. 2005; http://www.broad.mit.edu/mpg/tagger/). Several r 2 thresholds were used to assess the performance of selecting tag SNPs using the HapMap panels: 0.80, 0.85, 0.90, and 0.95. If a Cebu Filipino SNP exhibited pairwise r 2 ≥0.80 with at least one tag SNP (selected from the Asian HapMap panels), then the SNP was defined as captured in the Cebu Filipino sample. Percent coverage for a region is defined as the number of captured SNPs in the Cebu Filipino samples divided by the total number of SNPs (with estimated MAF ≥0.05). Finally, for each Cebu Filipino SNP, the maximum r 2 estimate obtained over all r 2 estimates between that SNP and a tag SNP in the region was identified. For a region, mean maximum r 2 was defined as the average value of the maximum r 2 values obtained over all Cebu Filipino SNPs in the regions.

Re-sequencing

Twenty-four randomly chosen Cebu Filipino samples were re-sequenced in the central 800 nucleotide (nt) region within each of the 40-kb ENCODE regions. Primers were selected using Primer3 software (Rozen and Skaletsky 2000; http://primer3.sourcegorge.net), and sequences were compared using Sequencher 4.2.2 (Gene Codes Corporation, Ann Arbor, MI). Sequencing was performed at the University of North Carolina, Chapel Hill, automated DNA sequencing facility on an ABI Prism 3730 (Applied Biosystems, Foster City, CA) using the Big Dye Terminator Kit.

Results

To determine the extent of similarity between 80 Cebu Filipino samples and HapMap samples, genotype data for 627 SNPs located within the ten HapMap ENCODE regions was used (Table 1).

Allele frequencies

Allele frequency estimates were compared using SNPs with MAF ≥0.05 in the corresponding HapMap panel. A total of 399 SNPs were evaluated when examining CHB, 391 SNPs for JPT, 396 SNPs for CHB+JPT, 431 SNPs for CEU, and 391 SNPs for YRI. The Spearman’s correlation coefficients for allele frequency estimates between the Cebu Filipino samples and the HapMap panels were 0.96, 0.92, 0.95, 0.82, and 0.65 for CHB, JPT, CHB+JPT, CEU, and YRI, respectively (Fig. 1). For comparison, the Spearman’s correlation coefficient for allele frequency estimates between CHB and JPT samples was 0.95 for 384 SNPs with MAF ≥0.05 in both panels. The percent of SNPs with significantly different allele frequencies (Fisher’s exact p-value < 0.01) was 5.7% for CHB, 15.6% for JPT, 11.6% for CHB + JPT, 57.7% for CEU, and 60.1% for YRI. Although larger sample sizes should provide greater power to detect statistically significant differences, the 89 CHB + JPT samples showed fewer significant differences with the Cebu Filipino samples than the smaller JPT, CEU, or YRI groups. The allele frequency comparison was repeated using HapMap SNPs with MAF >0 and slightly higher Spearman’s correlations were obtained with analogous patterns of similarity (data not shown).

Fig. 1
figure 1

Comparison of allele frequency estimates between Cebu Filipino samples and HapMap samples for SNPs with MAF ≥0.05 in the HapMap sample. Open symbols indicate SNPs with significantly different allele frequencies at a Fisher’s exact p-value <0.01

Based on the substantially greater similarity in allele frequencies between Cebu Filipino samples and Asian HapMap panels compared to CEU or YRI panels, subsequent analyses were performed using only the HapMap CHB, JPT, and CHB + JPT panels.

Linkage disequilibrium

Pairwise r2 for adjacent pairs and all pairs of SNPs within each HapMap ENCODE region in the Asian HapMap samples and Cebu Filipino samples were estimated to evaluate the extent of LD in each population. Only SNPs with MAF ≥0.05 in the corresponding HapMap sample were included in comparisons. Analysis was performed for 375, 368, and 373 adjacent pairs of SNPs and 9,350, 8,912, and 9,157 total pairs of SNPs for CHB, JPT, and CHB + JPT, respectively. The Spearman’s correlation coefficients of the r 2 estimates for adjacent pairs of SNPs between the Cebu Filipino and Asian HapMap samples were 0.90 for each of CHB, JPT, and CHB + JPT. The Spearman’s correlation coefficients of the r 2 estimates for all pairs were 0.88 for CHB, 0.87 for JPT, and 0.89 for CHB + JPT (Table 2). The absolute difference between r 2 estimates of adjacent SNP pairs was calculated. For CHB 51, 73, and 85% of the SNPs had absolute differences between r 2 estimates of ≤0.05, ≤0.10, and ≤0.15; for JPT 48, 67, and 80% of the SNPs had absolute differences between r 2 estimates of ≤0.05, ≤0.10, and ≤0.15; for CHB + JPT 50, 69, and 83% of the SNPs had absolute differences between r 2 estimates of ≤0.05, ≤0.10, and ≤0.15.

Table 2 Spearman’s correlation coefficients of all pairwise r 2 estimates between HapMap Asian panels and Cebu Filipino samples

When each of the ten regions was analyzed separately, LD differed both among regions and populations. Region ENr232 varied the most between HapMap panels; Spearman’s correlation coefficients of the r 2 estimates, for all pairs of SNPs, between the Cebu Filipino and Asian HapMap samples were 0.85 for CHB, 0.63 for JPT, and 0.77 for CHB + JPT. This region, however, did not differ from the other regions in allele frequency estimates, haplotype frequency estimates (below), and tag SNP analyses (below). The pairwise r 2 analysis was repeated using all HapMap SNPs with MAF >0 and obtained slightly higher Spearman’s correlations, but analogous patterns of similarity (data not shown). To confirm that Cebu Filipino sample size did not impact results, the analysis was repeated with three random sets of 45 Cebu Filipino samples. The sets of 45 were compared to CHB and JPT panels, and similar results were observed for the total set of 80 Cebu Filipino samples (data not shown). On average across all regions, Cebu Filipino samples show highly similar patterns of LD compared to all Asian panels, with slightly more similarity observed with CHB + JPT panels and slightly less observed with JPT panels.

Haplotype frequencies

Haplotype frequencies for the Asian HapMap panels and Cebu Filipino samples were estimated for haplotypes comprised of SNPs with MAF ≥0.05 in the HapMap panel. Haplotype blocks were defined using the default block definition used in Haploview (Gabriel et al. 2002). Within the ten regions, the average number of blocks per region was 3.6, 3.6, and 3.3 for CHB, JPT and CHB + JPT, respectively. The blocks ranged in size from 2 to 65 SNPs with an average of 9.7, 10.9, and 9.1 SNPs per block for CHB, JPT, and CHB + JPT. One hundred seventy-eight, 151, and 141 haplotypes were identified with frequency estimates >0.01 in CHB, JPT, and CHB + JPT, respectively. The Spearman’s correlation coefficient of haplotype frequency estimates between Cebu Filipino and Asian HapMap samples was 0.95 for CHB, 0.88 for JPT, and 0.92 for CHB+JPT (Fig. 2). Most haplotypes with an estimated frequency >0 in the Asian samples were also observed (with estimated frequency >0) in Cebu Filipino samples, demonstrating a high degree of haplotype conservation across the populations. Of the observed haplotypes with estimated frequency ≥0.05 in CHB, JPT, and CHB + JPT, only 2.5% (3 of 119), 2.8% (3 of 107), and 1.8% (2 of 112), respectively, were not observed in Cebu Filipino samples. In addition, of the observed haplotypes with estimated frequency >0.01 in CHB, JPT, and CHB+JPT, 23% (41 of 178), 24% (36 of 151), and 11% (16 of 141), respectively, were not observed in Cebu Filipino samples. The greater representation of Cebu Filipino haplotypes in CHB + JPT samples is likely attributed to the larger sample size. Overall, the haplotype frequency differences were modest between Cebu Filipino samples and the Asian HapMap panels, with CHB showing the most similarity and JPT showing the least similarity.

Fig. 2
figure 2

Comparison of haplotype frequency estimates between Cebu Filipino samples and Asian HapMap samples for SNPs with MAF ≥0.05 in the HapMap sample and haplotype frequency estimates >0.01 in the HapMap sample

Transferability of tag SNPs

To measure the efficiency of using the HapMap panels for tag SNP selection in the Cebu Filipino population, Tagger was used to select tag SNPs from the CHB, JPT, and CHB + JPT panels for SNPs with MAF ≥0.05. Tag SNP coverage was tested at four r 2 thresholds for selection in the HapMap panels, and the tag SNPs chosen in HapMap panels were applied to SNPs with MAF ≥0.05 in Cebu Filipino samples.

Overall, at each r 2 selection threshold using the CHB, JPT, and CHB + JPT panels, the percentages of SNPs captured (with a mean r 2 ≥ 0.80) in the Cebu Filipino samples were very similar. Using any of the three panels for SNP selection, the lowest r 2 selection threshold of 0.80 captured at least 82–83% of Cebu Filipino SNPs (MAF ≥0.05) across all ten regions (Table 3). To obtain this percent coverage 121, 118, and 125 tag SNPs from CHB, JPT, and CHB + JPT, respectively, would need to be genotyped. As expected, increasing the r 2 threshold for selecting tag SNPs in the Asian HapMap samples increased both the number of tag SNPs that needed to be genotyped and the proportion of Cebu Filipino SNPs captured by these tag SNPs. However, the percent coverage of each region varied substantially. At the r 2 selection threshold of 0.80, the percent coverage ranged over the ten regions from 54 to 96%, 59 to 96%, and 52 to 94% using CHB, JPT, and CHB + JPT tag SNPs, respectively. This variability between regions was still observed at a r 2 selection threshold of 0.95, the highest r 2 threshold studied. In addition, at each r 2 selection threshold, the mean maximum r 2 of all Cebu Filipino SNPs (MAF ≥0.05) was similar between CHB, JPT, and CHB + JPT. Among all SNPs, for the r 2 selection threshold of 0.80, a mean maximum r 2 of 0.88 was observed using CHB, JPT, and CHB + JPT tag SNPs. As expected, the mean maximum r 2 increased at each increase of the r 2 selection threshold. Little variability was observed between regions (data not shown).

Table 3 Coverage of the Cebu Filipino samples by tag SNPs selected from Asian HapMap panels

At each r 2 selection threshold, the SNPs in the Cebu Filipino samples that were not captured by a tag SNP selected in the HapMap panels were evaluated. The percentage of SNPs not captured and the mean maximum r 2 for each SNP were calculated (Table 3). Consistent with the sensitivity of tag SNP selection to allele frequency (Schulze et al. 2004), many of the SNPs not captured were rare (MAF <0.10). These rare SNPs had low mean maximum r 2 and were not captured using higher r 2 selection thresholds. Common SNPs (MAF ≥0.10) that were not captured at an r 2 of at least 0.80 were captured with at least a mean maximum r 2 of 0.65, 0.64, and 0.66 using CHB, JPT, and CHB + JPT tag SNPs, respectively. As the r 2 selection threshold increased, more of these Cebu Filipino common SNPs were captured with an r 2 of at least 0.80.

Re-sequencing

To assess the frequency of population-specific novel SNPs and to further evaluate the genetic structure in Cebu Filipinos, 24 Cebu Filipino individuals were re-sequenced in an 800-nt region within each of the ten HapMap ENCODE regions used previously for HapMap re-sequencing (The ENCODE Project Consortium 2004). Approximately 184 kb on at least one DNA strand were re-sequenced. Only one novel SNP was detected that was not present in HapMap (data release 21, July 2006) or dbSNP (build 126); the SNP was located in region ENr213 (ss69374772) and had a MAF of 0.05 in 80 Cebu Filipino individuals. Within Cebu Filipino samples, this SNP exhibited a maximum r 2 of 0.228 with four other SNPs in the 40-kb region.

Discussion

The extent of similarity between Cebu Filipino samples and the previously evaluated HapMap samples were examined using measures of allele frequency estimates, pairwise r 2 estimates, and haplotype frequency estimates. Consistent with population migration, mitochondrial DNA, and Y haplotype patterns (Jin and Su 2000), CEU and YRI samples were much less similar to Cebu Filipino samples with respect to allele frequency than CHB, JPT, or CHB + JPT samples. All of the analyses showed high similarity between Asian HapMap samples and Cebu Filipino samples.

Because the CHB and JPT samples have similar allele frequencies, these data sets are often combined for analyses (The International HapMap Consortium 2005). The existence of these two Asian HapMap panels allowed for evaluating the choice of using CHB, JPT, or the larger combined CHB + JPT panel as a resource for choosing haplotype tagging SNPs for Cebu Filipino samples. Among these three panels, JPT samples were the least correlated with Cebu Filipino samples with respect to allele frequency estimates, pairwise r 2 estimates, and haplotype frequency estimates. Cebu Filipino and CHB allele frequency estimates were more closely correlated than CHB and JPT allele frequency estimates. Both CHB and CHB + JPT panels were very similar to Cebu Filipino samples, and it is not clear which panel would act most efficiently as a proxy for the Cebu Filipino samples. The larger CHB + JPT sample size would be expected to decrease the variability in the allele and haplotype frequency estimates; the added JPT samples could decrease accuracy. Indeed, estimated Cebu Filipino allele and haplotype frequencies were slightly more correlated with CHB than CHB + JPT, but Cebu Filipino pairwise r 2 estimates were slightly more similar to CHB + JPT than CHB.

A practical use of HapMap is to select tag SNPs for regional or genome-wide association studies (The International HapMap Consortium 2005). Evaluation was performed on the transferability of HapMap tag SNPs chosen using the data from CHB, JPT, and CHB + JPT panels at several r 2 selection thresholds, with respect to capturing the genetic variability in samples from Cebu, Philippines. Using these criteria, at an r 2 selection threshold of 0.80, the HapMap-based tag SNPs capture 82–83% of the Cebu Filipino SNPs. A majority of the most common SNPs (MAF ≥0.10) in the Cebu Filipino sample that are not captured by the tag SNPs at an r 2 of at least 0.80 are captured with an r 2 of at least 0.60. Using higher r 2 thresholds for tag SNP selection in the HapMap samples results in capturing more SNPs in the Cebu Filipino sample, but with the added cost of genotyping more tag SNPs. Increasing the r 2 threshold failed to capture substantially more rare SNPs, most of which exhibited low pairwise LD with other SNPs.

Previously, de Bakker et al. (2006) showed through extensive SNP discovery and simulations that power to detect disequilibrium-based association is only modestly compromised when an appropriate selection of tag SNPs are chosen from HapMap samples and applied to other case-control samples. Large scale SNP discovery and power simulations were beyond the scope of this study. However, based on the findings from de Bakker et al. (2006) and the current findings that tag SNPs selected using the Asian HapMap adequately captured common Cebu Filipino SNPs, the average loss in power to detect common casual alleles should be small.

Re-sequencing and genotyping were performed in the ten HapMap ENCODE regions that were re-sequenced for SNP discovery and are considered to be a gold standard because of the high density of SNP coverage (The ENCODE Project Consortium 2004). Only one SNP (estimated MAF = 0.05) was detected in the Cebu Filipino samples that was not observed in dbSNP or HapMap, suggesting that alleles ascertained from the HapMap ENCODE regions were representative of the common variation in Cebu Filipinos and that additional re-sequencing of these regions would not be required to detect common SNPs in Cebu Filipino samples. While future SNP selection in genome regions that have not been re-sequenced will be based on less complete SNP identification, the Asian HapMap panels will likely either include or tag most of the common SNPs present in Cebu Filipino samples.

Measures of LD, gene density, and haplotype blocks vary across the genome (Ke et al. 2004; De la Vega et al. 2005), and the HapMap ENCODE regions analyzed represent a range of these and other characteristics (The ENCODE Project Consortium 2004), suggesting that our results may apply, on average, across the genome. The strong correlations observed between Cebu Filipino samples and HapMap Asian panels are broadly consistent with other assessments of tagging transferability outside the HapMap ENCODE regions (Weale et al. 2003; Nejentsev et al. 2004; Ke et al. 2005; Mueller et al. 2005; Ramirez-Soriano et al. 2005; Evans and Cardon 2005; Gonzalez-Neira et al. 2006; Huang et al. 2006; Mahasirimongkol et al. 2006; Montpetit et al. 2006; Ribas et al. 2006; Willer et al. 2006).

Our results are consistent with previous studies that compared the Asian HapMap panels to other Eastern Asian samples. Studies that examined many populations worldwide found Asian and Oceania populations to be most similar to the Asian HapMap panel tested (Conrad et al. 2006 and Gonzalez-Neira et al. 2006). Two studies have investigated the tagging transferability between the HapMap CHB, JPT, and CHB + JPT with sample sets from Thailand and from Korea (Mahasirimongkol et al. 2006; Yoo et al. 2006). A combination of tag SNPs from CHB + JPT best captured the LD structure of the Thais, while SNP selection based on JPT was most transferable to the Korean samples. In comparison, our results suggest that CHB samples and the combined CHB + JPT samples are most similar to Cebu Filipino samples, although our results do not necessarily reflect the patterns of genetic variability across the Philippines. Our findings will be useful for the future design and analysis of genetic studies in the Cebu Filipino population.