SNP discovery is one of the hottest topics in human genetics. But just how accurate are existing SNP collections and what strategies should be used to discover enough SNPs for effective association studies and the Haplotype Map project? Two new studies published in Nature Genetics show that our existing collections are high quality, but in future we will be well advised to collect more so-called 'double-hit' SNPs — for which both alleles have been detected at least twice — and to sample more ethnically varied populations.

Some 2.7 million uniquely mapped SNPs are currently available in public databases. But is that enough? Carlson and colleagues resequenced 50 genes in samples from European Americans and African Americans to assess the power of the available SNPs to detect the risk of developing common diseases. Although 52% of SNPs were common to both populations in this study, a substantial proportion of common SNPs were private or common only in one population, indicating that different populations must be used in SNP discovery if the result of association studies are to be applicable across all human populations.

Looking at the SNP databases, they found that the number of polymorphic SNPs in both of their study populations were equal. Given that African populations are more variable, this indicates the existence of a European bias among database SNPs.

The power of SNP collection to detect risk variants is related to the strength of linkage disequilibrium (LD) between unassayed and assayed markers. In their two populations, the authors found that if all 2.7 million SNPs were assayed, 77% of common SNPs would be ascertained in European but only 50% among the African Americans. This indicates that the present SNP collection would be more effective in association studies of European than African Americans.

So, the authors propose that most of the SNPs that are required to assemble a comprehensive map for the European American populations have been discovered. Considerable extra effort is, however, required for other populations. Clearly, assaying the 2.7 million SNPs is not feasible. But, although a several-fold smaller subset can capture the same information, defining this subset, which is crucial for the success of association studies, will require more studies of allele frequencies and LD relationships for all SNPs in the databases.

Reich et al. sequenced segments from 17 loci from European Americans and West Africans. Having compared their results to the public and Celera SNP databases to assess their quality and completeness, they found that 6–12% of the database SNPs were not valid. Carlson and colleagues put this figure even higher at 28–35%, but this might reflect peculiarities of the regions that the two groups studied.

Reich and colleagues also assessed the proportion of all genetic variation in their studied region that was attributable to SNPs already found in the databases. They found that >50% of SNPs that are informative for haplotype mapping (that is, with a minor-allele frequency of >10%) are already present in the databases, but over one-third of these are of minor frequencies or are false positives. So, considerable efforts could be spent on genotyping SNPs that are not as informative.

The authors propose that future efforts should focus on double-hits. As 99% of the double-hit SNPs that were examined in their study were validated, and 91% had >10% minor-allele frequency, this seems like sound advice. Although both groups agree that the existing SNP collection is high quality, this is unlikely to be the last we hear of the great SNP debate.