Introduction

Both linkage and association studies are employed to investigate the genetic basis of complex diseases such as atherosclerotic cardiovascular disease. Based on the premise that association mapping has greater power in identifying genetic determinants of complex diseases,1 this study design is being increasingly used. Commonly, association studies test putative functional single nucleotide polymorphisms (SNPs) within candidate genes and regions – the so-called ‘direct’ approach.2 An alternative is the ‘indirect’ approach (ie, a linkage disequilibrium (LD) – based approach), in which a subset of markers in a region of interest are selected from small panels of subjects and then used in large-scale association analyses.

The concept of LD is key to designing ‘indirect’ association studies for complex diseases. Regions with extensive LD, i.e., haplotype blocks, have been found interspersed with regions of medium and low LD in the human genome.3, 4 One way to reduce genotyping effort for association mapping of complex diseases is the use of haplotype tagging SNPs (htSNPs) or tagging SNPs (tSNPs). The two terms, htSNPs and tSNPs, refer to two different strategies for choosing the optimal minimum subset of SNPs from the entire set of SNPs. htSNPs are selected based on the haplotype-block model of LD pattern in a region of interest and represent the common haplotypes inferred from the original set of SNPs.5 On the other hand, tSNPs are selected based on measures of association, such that a tSNP predicts partially or completely the state of other SNPs.6

Thus far, several methods to select htSNPs or tSNPs have been described, and these can be broadly classed into four categories. First, there are the methods based on defining how well a subset of SNPs captures the variation in the complete set.5, 6, 7, 8, 9, 10, 11 The second category of methods is based on principal component analysis (PCA) to reduce the dimensions of complete sets of SNPs.12, 13 The third category is based on association or correlation between SNPs (ie, LD).14, 15 The fourth category includes several methods, an example being a method based on set theory that recursively searches the minimal set of SNPs with a given function.16, 17 These four categories can also be grouped into haplotype-block-based methods5, 7, 8, 9, 12 and haplotype-block-free methods.6, 11, 15 The tSNPs derived from these two classes give different genome coverage because of the varying ‘blockiness’ in the human genome.18 In this paper, we use the term tSNPs to represent an ‘optimal’ selection of a subset of SNPs from the original set of SNPs, identified using either haplotype-block-based- or haplotype-block-free methods.

A recent review19 described the methodological and conceptual differences between the available tagging algorithms. However, no systematic comparison of the available methods for selecting tSNPs has been performed, and a consensus method for choosing tSNPs has not been established. The researcher has been offered little guidance in the choice among these methods, that is, which method for choosing tSNPs is most appropriate for a particular candidate gene-based association study? We attempted to compare several leading tSNPs selection methods in 10 representative gene regions by using resequenced genotype data (pga.mbt.washington.edu). We assessed tagging efficiency (TE) and prediction accuracy of tSNPs derived by these methods. In addition, we investigated the impact of minor allele frequency (MAF) cutoff, tagging criteria, and LD level on tSNPs selection, as well as the tagging consistency between different methods.

Materials and methods

Gene selection

Sequence data for 87 candidate genes for atherosclerotic cardiovascular diseases from 23 European-Americans were downloaded from the SeattleSNPs database (http://pga.mbt.washington.edu) on March 11, 2005.15, 20 Ten representative genes (Table 1) were selected for comparing different tSNPs selection methods based on the following criteria: (1) LD level varied from strong LD (D′>0.8), to moderate LD (0.4 <D′≤0.8), and to weak LD (D′≤0.4).21 A measure of LD (D′) was calculated using LDA software,22 and the level of LD was assessed by use of sliding-window plots of average D′ in each gene (Figure 1); (2) the length of the genes was close to the mean length of the 87 genes (mean±SD of sequenced length in the 87 candidate genes=21.3±13.4 kb and median=17.7 kb).

Table 1 LD level, number of SNPs, and singletons in 10 representative genes
Figure 1
figure 1

Sliding window plots of average LD measure (D′) in the 10 genes included in the present study. Average D′ was calculated from all SNP pairs in 5 kb sliding windows (1 kb increment between windows starting from 2.5 kb). If there were no SNPs in a given window, the D′-value was assigned as a value of ‘not available’. Different LD levels were present in different gene regions: strong LD (D′>0.8), moderate LD (0.4 <D′≤0.8) and weak LD (D′≤0.4).

Methods for the selection of tSNPs

We compared eight published methods of identifying tSNPs. Most of the methods are based on searches to evaluate subsets of SNPs using different measures and include All common haplotypes,7, 8 Haplotype diversity,5 Coefficient of determination (R2h),10 Haplotype entropy (Entropy),23 and Haplotype r2 (TagIT).11 Another set of methods is based on PCA, for example, the method described by Lin and Altman.13 Carlson et al15 developed the method LD r2 (based on pairwise LD), in which the maximally informative site and all associated sites are grouped into a bin. Sebastiani et al17 described a method (BEST) in which all optimum fully informative tSNPs are generated based on set theory. A brief summary of each method, including measures or statistics, comments and original references, is presented in Supplementary Table 1.

Haplotype-block definitions

The above eight methods can also be classified as haplotype-block-based methods (eg, All common haplotypes, Haplotype diversity, R2h, and Entropy) and haplotype-block-free methods (eg, TagIT, LD r2, PCA, and BEST). Haplotype blocks have been defined based on diversity,7, 8 LD,3 and recombination.24 Comparisons of haplotype blocks based on these definitions have revealed similarity between the LD-based method and the recombination-based method.25, 26, 27 We choose to define haplotype blocks based on LD when haplotype-block-based tSNPs selection methods were employed. The LD-based haplotype-block definition requires that the proportion of SNP pairs with strong D′ (absolute D′≥0.70) must account for at least 95% of pairs of SNPs.3

Selection of tSNPs under different MAF and tagging criteria

The following programs were available from the authors' websites: Hapblock,28 ldSelect for the method LD r2,15 TagIT,11 and BEST.17 Hapblock integrates several methods, including All common haplotypes, Haplotype diversity, R2h, and Entropy. We implemented the PCA algorithm13 in MatLab®, using varimax-rotation method to map tSNPs after eigenSNPs were mathematically selected.

We used tagging criteria as a measure for quantifying the proportion of variation captured by a tSNPs set. The tagging criteria of 0.70, 0.80, and 0.90 at MAFs of 0.10 and 0.20 were assessed in six methods, including Haplotype diversity,5 Entropy,23 R2h,10 LD r2,15 Haplotype r2 (TagIT),11, 29 and PCA.13 For each method, combinations of parameters of tagging criteria and MAF were input. For the remaining two methods, that is, All common haplotypes7, 8 and BEST,17 there was no need to input the tagging criteria. We compared TE, prediction accuracy, and tagging consistency between different tSNPs selection methods as described below.

Tagging efficiency (TE)

TE was defined based on Ke et al30 as

where nh is the number of tSNPs and n is the total number of SNPs under different MAF cutoffs. The measure of TE provides an estimate of the savings in genotyping offered by tSNPs and is expected to vary under different MAF cutoffs.

We also selected a 100 kb ENCODE region (ENCyclopedia Of DNA Elements, a project that aims to produce a dense set of genotypes across large genomic regions) on chromosome 7q21.13 (www.hapmap.org) to test the effect of sample size (ie, n=24, 48, 72, and 90) on TE.

Prediction accuracy

Halperin et al31 have proposed a measure of prediction accuracy to evaluate the quality of tSNPs and described its utility in selecting tSNPs given the genotype information of SNPs from a set of unrelated individuals. The measure aims to maximize the expected accuracy of predicting untyped SNPs, given the unphased information of the tSNPs.31 Formally, for a given set of SNPs t, the objective is to find a set of tSNPs S of size t and a prediction function f such that the prediction error is,

where, ZS is the restriction of the genotype to the tSNPs position, and g(j) is the j-th SNP in genotype g. We calculated the prediction accuracy of the sets of tSNPs generated by the eight methods under two MAF in 10 genes using the program Gerbilview.32

Tagging consistency

Let set1 and set2 denote two sets of tSNPs derived either from one population using two different methods or from two populations using one method. To quantify the consistency or similarity between the sets of tSNPs, we used the methods of Schwartz et al27 and Liu et al33 that assess whether or not tSNPs from two different methods or two populations coincide. The P-value (P(set1, set2)) is from Fisher's exact test for the null hypothesis that the two tSNPs sets are independent.

where B1, B2 are the numbers of tSNPs in set1 and set2, respectively, m is the number of tSNPs shared by set1 and set2, and L is the total number of SNPs in the regions under study. The consistency measure (C) is defined as the negative logarithm of the P(set1, set2) value,

Results

The eight methods of tSNPs selection were compared in 10 genes (Table 1). These 10 genes had different levels of LD and their genomic length was close to the average length of the 87 candidate genes for atherosclerotic cardiovascular disease. The LD pattern of these genes for European-Americans is illustrated in Figure 1, indicating strong LD in DO, IL1A, MGP, and VKORC1, moderate LD in ALOX12, SELL, and VCAM1, and weak LD in F2, F10, and ICAM1.

Tagging efficiency

For each method of selecting tSNPs, we defined TE as the total number of markers in the region of interest divided by the number of tSNPs chosen by a particular method, based on Eq. (1). The TE of the eight tSNPs selection methods across different gene regions with different levels of LD, using two MAF cutoffs (0.10 and 0.20) and tagging criterion of 0.90, is shown in Table 2 and Supplementary Figure 1. The mean TE varied (from ∼ 2 to ∼ 25) depending on the method of tSNPs selection and the LD level of the gene region. The overall TE was highest for Haplotype diversity and TagIT. As expected, the LD level in the gene regions affected TE; for most tSNPs selection methods, TE was higher in strong LD regions than in regions of moderate LD and weak LD. The variance in TE in the high LD regions was weakly related to method of tSNPs selection (two-way analysis of variance (ANOVA), P=0.090) but not to MAF (P=0.998). In the moderate and low LD regions, the variance in TE could be attributed to the method used for tSNPs selection (P=0.028 and P=0.003 in moderate and low LD regions, respectively) as well as MAF (P=0.002 and P=0.009 in moderate and low LD regions, respectively).

Table 2 Tagging efficiency of the eight methods of tSNPs selection

To investigate whether TE is significantly affected by tagging criteria under the two MAF, we performed ANOVA for six methods (no tagging criteria were input for methods of All common haplotypes and BEST; see Supplementary Table 1) under different MAF cutoffs and three different tagging criteria (0.70, 0.80, and 0.90). TE was significantly affected by tagging criteria, for several methods, especially PCA, TagIT, Entropy, and Haplotype diversity, in regions of high LD (Supplementary Table 2). In regions of moderate and low LD, TE was significantly affected by tagging criteria for the methods PCA and TagIT.

We also tested the effect of sample size (ie, n=24, 48, 72, and 90) on TE in the 100 kb ENCODE region on chromosome 7q21.13. No significant difference in TE was noted at different sample sizes (Supplementary Figure 2).

Prediction accuracy of tSNPs

The prediction accuracy of tSNPs for gene regions with different LD levels, using the two MAF cutoffs (0.10. and 0.20) and tagging criterion of 0.90 is shown in Table 3 and Supplementary Figure 3. The prediction accuracy of tSNPs was comparable among different methods in gene regions with different LD levels. If we set prediction accuracy of 0.90 as a threshold in regions with high or moderate LD, the prediction accuracy under different MAF cutoffs exceeded or approached the threshold for all eight methods. There was no significant difference in prediction accuracy of tSNPs under the two MAF cutoffs in any of the eight methods. No significant differences in prediction accuracy were noted among different methods at various LD levels (P>0.05, two-way ANOVA). Thus, neither the choice of tSNPs selection method nor level of LD affected the prediction accuracy.

Table 3 Prediction accuracy of the eight methods of tSNPs selection

Tagging consistency

In general, different methods for selecting tSNPs generated different sets of tSNPs. To quantify and examine the consistency between tSNPs generated by different methods of tSNPs selection, we used the tSNPs similarity measure in Eq. (4) and the test of significance in Eq. (3). We compared the results generated under a MAF cutoff of 0.10 and tagging criterion of 0.90. There was greater similarity between tSNPs identified by the methods All common haplotypes, Entropy, R2h, and BEST, than between the remaining methods. Supplementary Table 3 summarizes the statistical tests of the null hypothesis of independent tSNPs from pairwise comparison of the methods All common haplotypes, Entropy, R2h and BEST. It can be seen that most of the pairwise comparisons among these four methods provided evidence against the null hypothesis of independent SNP selection by different methods (P<0.05). When comparing regions with different levels of LD, we observed that the fraction of significant results of pairwise comparisons decreased with decreasing LD level. In case of IL1A (high LD), six out of six pairwise comparisons were significant compared with three out of six in the ALOX12 gene (moderate LD), and two out of six in the F2 gene (weak LD).

Computational cost

Finally, we compared the computational cost of each method of tSNPs selection. Five methods (All common haplotypes, haplotype diversity, Entropy, LD r2, and R2h) were run under Linux with an AMD® athlon 2800+MP CPU and the other three methods (BEST, PCA, and TagIT) were run under a Windows XP® system with a 2.8 GHz CPU. The computational cost was comparable at MAF 0.10 and 0.20 among the different methods. For example, in the ALOX12 gene, with MAF=0.10 and tagging criterion 0.90, R2h and Entropy took <10 min to get the results using the Hapblock program, whereas the other methods took <1 min. However, when comparing tSNPs selection methods in larger genomic regions (such as the 500 kb ENCODE regions), the computational burden for the methods R2h, Entropy, and BEST was significant. For example, the runtime for the program BEST grew exponentially and we were unable to get results even in 2 weeks for a 227 (marker number) by 180 (haplotype sample size) matrix under a windows server (2 GHz CPU and 3.50 GB of RAM).

Discussion

In this paper, we compared tagging efficiency (TE), prediction accuracy, and tagging consistency of tSNPs generated from eight published methods of tSNPs selection. The comparisons were carried out using sequence data for 10 representative candidate genes for atherosclerotic cardiovascular disease with varying levels of LD in a sample of European-Americans (Figure 1, Table 1).

Several factors, including LD level, MAF, tagging criteria, and sample size may affect TE. TE was affected significantly by the level of LD and was higher in genes with higher level of LD. In high LD regions, the amount of variance in TE was weakly related to the different methods of tSNPs selection but not MAF. However, in moderate and low LD regions, TE was influenced by the method of tSNPs selection as well as the MAF cutoff. There appeared to be nearly an order of magnitude difference in TE between some of the methods (eg, a lower efficiency using LD r2 and a higher efficiency using TagIT for ALOX12) (Table 2 and Supplementary Figure 1). This difference may be due to long-range associations between SNPs. For example, LD may exist between bins, which were partitioned based on LD r2 (or between haplotype blocks, such as All common haplotypes), whereas TagIT is able to incorporate such long-range LD.3, 29 Tagging criteria influenced TE in the genes with strong or moderate LD levels, especially for the methods PCA, TagIT, and Haplotype diversity (Supplementary Table 2). With increase in tagging criteria from 0.70 to 0.80 to 0.90, a greater number of tSNPs was needed to tag the entire gene region for these three methods. As the SeattleSNPs investigators used a relatively small number (n=23 European-Americans) of subjects for SNP ascertainment, we evaluated whether the perceived TE was affected by larger sample sizes. No significant change in TE was noted using larger sample sizes (n ranged from 24 to 90) for each tSNPs selection method (Supplementary Figure 2).

The prediction accuracy of tSNPs selected by different methods approached or exceeded the threshold of 0.90 (Table 3 and Supplementary Figure 3). Neither the choice of tSNPs selection methods nor the level of LD significantly affected the prediction accuracy (two-way ANOVA, P>0.05). Given the higher TE of Haplotype diversity and TagIT, the prediction accuracy of these two methods was higher in the gene regions with high LD and comparable to other methods in the moderate and low LD regions.

In order to investigate whether TE and prediction accuracy were different in genes larger than the ones we initially studied (10 genes, 10∼30 kb), we assessed TE and prediction accuracy in an additional five genes ranging in size from 30 to 50 kb. A similar pattern for TE and prediction accuracy among different methods was noted (Supplementary Figure 4).

We plotted prediction accuracy (on the X-axis) versus TE (on the Y-axis) for the 10 genes for each tSNPs selection method to assess the tradeoff between prediction accuracy and TE at various LD levels (Supplementary Figure 5). However, for a given method of tSNPs selection, no simple linear relationship between TE and prediction accuracy was obvious in the 10 genes. We also calculated the measure ‘tagging effectiveness’ described by Ke et al34 to assess the percent of hidden (untyped) SNPs with r2≥0.80 to the haplotypes defined by a tSNPs set (Figure 2). For all the eight tSNPs selection methods, tagging effectiveness in high LD regions was much higher than that in moderate and low LD regions. However, within the three levels of LD, tagging effectiveness was similar among the eight tSNPs selection methods (P>0.05 by ANOVA).

Figure 2
figure 2

The percent of hidden (untyped) SNPs with r2≥0.80 to haplotypes defined by a tSNPs set under two MAF cutoffs.

Pairwise comparisons of tSNPs sets revealed poor consistency between tSNPs selected using any two of the eight methods. A limited degree of tagging consistency was present between tSNPs derived from four methods (All common haplotypes, Entropy, R2h, and BEST), three of these methods (All common haplotypes, Entropy, and, R2h) being haplotype-block-based (Supplementary Table 3). This may be due to the low haplotype diversity in each block for the haplotype-block-based methods, and therefore a greater likelihood for similar tSNPs to be selected to represent common haplotypes using two different methods. Among haplotype-block-free methods, the underlying principles for choosing tSNPs are diverse; for example, TagIT incorporates all associations between SNPs along a region, whereas LD r2 considers association between SNPs in a bin (Supplementary Table 1). Thus, the tSNPs sets identified by haplotype-block-free methods differed considerably. We found little similarity between tSNPs sets generated from the remaining four methods (Haplotype diversity, LD r2, TagIT, PCA) (analyses not shown). Forton et al35 have suggested that haplotype reconstruction by tSNPs generated by haplotype-block-based methods is more accurate than haplotype-block-free methods.

The International HapMap project is meant to facilitate the optimal selection of SNPs for cost-effective and robust whole-genome association studies.36 Using the methods described above, we obtained tSNPs sets for the same 10 genes in 24 African-Americans using resequenced data from the SeattleSNPs database. Tagging consistency between European-Americans and African-Americans was measured using Eq. (3). We found that the tagging consistency between the two ethnic groups, or ‘tSNPs transferability’,37 for any of the eight methods was poor, indicating that the tSNPs set selected for European-Americans are not transferable to African-Americans (analyses not shown). However, tSNPs may be transferable between different geographical samples of an ethnic group37 or between various non-African populations.38

At least two initiatives, SeattleSNPs15, 20 and the Environmental Genome Project (EGP),39 have resequenced several hundred candidate genes involved in inflammation and environmental response, to facilitate candidate-gene-based association studies. These two projects used a small panels of subjects (n=23–30) belonging to different ethnic groups to characterize polymorphic variation and pattern of LD in the candidate gene regions. In the present study, we used resequenced data from SeattleSNPs (n=23 European-Americans and 24 African-Americans). It has been estimated that 48 chromosomes would identify ∼99% of SNPs with a MAF≥5%.40 In a simulation study, Thompson et al41 found that using such a sample size (25 unphased individuals) to select tSNPs did not reduce the power of an association study, compared to using all SNPs.

Comparing various tSNPs selection methods is far from straightforward. First, selecting representative gene data sets for analysis is problematic because of different LD levels in different genes and the variability in the number of SNPs among genes. Second, the size of candidate genes and genomic regions to be studied could be much larger than the regions (50 kb maximum) investigated in the present study and the extent of LD could also extend well beyond this size. Third, there is no consensus on what are the most appropriate statistics to evaluate the performance of tSNPs sets. Each method for choosing tSNPs has its own quality measure to optimally select a set of tSNPs. The measure we used for evaluating the quality of tSNP selection – prediction accuracy – aims to maximize the expected accuracy of predicting untyped SNPs, given the unphased (genotype) information of the tSNPs.31 Fourth, there is no simple relationship between TE and accuracy that allows one to choose an optimal balance of these two measures. Recently, Ke et al34 used a matching TE among three tSNPs selection methods to assess ‘tagging effectiveness’. Generating a matching TE to compare prediction accuracy of eight tSNPs selection methods would require significant computational resources beyond the scope of the present study.

A major expectation from using tSNPs is that the genotyping cost is reduced, whereas at the same time the statistical power for identifying associations is only minimally compromised. Statistical power may be an important metric in deciding which method is the most optimal in association studies. A direct comparison of tSNPs selection methods in the context of statistical power may be possible in a simulation study,42 but was outside the scope of the present study. Another expectation of tSNPs selection methods is flexibility, allowing one to force a specific SNP, for example, a nonsynonymous SNP, into the set of tSNPs. Some programs, such as Hapblock,28 allow insertion of a specific SNP into a tSNPs set. Flexibility would also allow one to replace a SNP that cannot be genotyped with an alternate tSNP, for example, an alternate SNP within the same bin (LD r2) or the same haplotype block (Haplotype diversity).

Except for LD r2, which uses genotype data to calculate the pairwise LD measure (r2), the methods for selecting tSNPs are based on haplotype data. We used haplotypes inferred from the PHASE program43 to generate the input for each method. Although convenient, statistical inference of haplotypes is associated with a degree of uncertainty as a proportion of the inferred haplotypes may be incorrect. This may reduce the statistical power of a haplotype approach to detect an association with disease.43, 44 How the tSNPs selection methods compare when genotype data is used instead of haplotypes needs further study. The use of genotype data combined with a PL–EM (Partitioning-Ligation–Expectation-Maximization) algorithm for choosing tSNPs maybe comparable to the use of haplotypes in association studies.9

A limitation of the present analyses is that there are moderate amounts of missing data in the 10 genes (missing data ranged from 1.9 to 8.2%). The PHASE program imputes missing data when haplotypes are constructed. How the missing data rate might affect tSNPs selection is unclear although Zhang et al9 found that the statistical power and the number of tSNPs with and without moderate missing data were similar, even with 10% data missing.

In conclusion, our comparison of the performance of several methods for choosing tSNPs revealed that TE varied with the methods, being highest for Haplotype Diversity5 and TagIT (haplotype r2).11 Because the prediction accuracy and the computational cost were similar among different methods, the methods Haplotype Diversity and TagIT may be considered initially for tSNPs selection. We found limited tagging consistency between tSNPs generated by different tSNPs selection methods, and tSNPs had limited transferability between African-Americans and European-Americans. This work demonstrates that when tSNPs-based association studies are undertaken, the choice of method for selecting tSNPs requires careful consideration.