In reply:

Walton et al. discuss two points regarding the selection of tagging SNPs (tSNPs) for genetic association studies. The first relates to the genomic boundaries within which tSNPs are selected. The computational constraints on multiple-marker methods of selecting tSNPs make it impractical to consider contiguous stretches of sequence beyond a certain size; beyond this, it is necessary to follow some scheme to subdivide the region. Walton et al. note that the division selected for one of our regions, the polygenic CYP2C region, will influence the tSNP efficiency (i.e., minimizing the number of tSNPs needed to achieve a desired level of power against untyped causal variants). Second, Walton et al. note that there may be long-range haplotypes extending across genes in the CYP2C cluster that may form functional units. As general principles, we agree with both these points. In fact, we were among the first to show that combining smaller regions into single large ones, where possible, led to considerable increases in SNP tagging efficiency1. We also think that the occurrence of two or more genes in strong linkage disequilibrium (LD) could make it useful to consider intergene haplotypes as integrated units in association studies.

Both these areas are worthy of further evaluation. In our view, there has not yet been any thorough investigation of optimal approaches for subdividing regions in such a way as to maximize the performance of tSNPs. It is a difficult problem because of the inherent trade-off between tagging efficiency and the size of the region being tagged. The larger the region is, the more one can capitalize on LD1 but the more difficult it becomes to infer haplotypes accurately. There is also a danger of overfitting with haplotype-based models in which the degrees of freedom of the model can grow rapidly with the number of tSNPs required. There comes a point at which regions become too large to tag using aggressive multiple-marker methods. It is, therefore, uncertain how best to set break points between regions in such situations. For example, Walton et al. base their assessment on using point estimates of D′. In contrast, the method of Gabriel et al.2, in which the degree of confidence in a given D′ value is used in the assessment of an LD block, does not define an extended LD block covering both CYP2C19 and CYP2C9 (see Fig. 2 in ref. 3). There are, however, no a priori grounds for assuming either of these approaches will result in an optimal subdivision of the region in terms of optimizing tagging performance (and almost certainly they do not). In the particular case of the CYP2C region, however, we note that using the subdivision suggested by Walton et al. results in slightly fewer tSNPs than does using the arbitrary subdivision that we used. Finally, we note that our study3 used just one of many possible tSNP selection methods (based on haplotype r2), and the relative power of different methods for different genomic regions is a further unresolved issue.

In summary, some method of subdividing regions in order to maximize tSNP efficiency would be useful, but the best method for doing this is currently not known. Similarly, consideration of longer-range functional haplotypes spanning more than one gene has not been given much attention in the tSNP literature. Both these areas are worthy of further consideration, and Walton et al. are right to call attention to them.