In reply:

The main point of our original paper1 was that even the relatively small levels of structure in large populations cannot be ignored in the coming generation of association studies, effectively because of the sizes of these studies (both sample size and numbers of loci). We continue to believe, however, that association studies have a central role in unraveling the genetic basis of common human diseases, provided that population structure is handled appropriately. One published method for dealing with population structure is Genomic Control (GC)2. Our paper showed that GC typically performs well but that there are some previously unrecognized problems in certain settings.

We are delighted that our work prompted Devlin and his colleagues to correct this aspect of GC. Their new procedure, GCF, represents an important advance and should be used in place of the original method. We also agree that this approach to handling uncertainty in the estimation of the correction factor λ is better than the use of confidence limits3.

But whether the settings in which GC had problems should be dismissed as 'extreme' is less clear. Of course the design and analysis of studies should attempt to control for stratification. This is not simple to do in practice. First, there are important unresolved empirical questions about the levels and nature of such structure in population groups (e.g., people of European descent in a particular country or African Americans) and unresolved statistical issues about how best to use this kind of information in study design and analysis. Second, in the real world many studies will not meet these worthy objectives, in some cases because relevant confounding factors are not known or not easily measured and in other cases because investigators apportion their limited resources in other directions. Finally, as our paper noted1, even with the best design and analysis, there is likely to be a level of residual structure after allowing for known confounders. At present there is limited relevant data to determine the probable levels of residual structure, but the simulations in our paper deliberately included plausible scenarios for these. Notably, in their original paper2, Devlin and Roeder described the level of population structure that we considered in ref. 1 (F = 0.01 in their notation) as “realistic”. Further, as noted in ref. 2, cryptic relatedness poses as much of a threat to association studies as does geographic population structure and is much more difficult to reduce by experimental design. Preliminary analysis of a large UK case-control study (886 cases, 878 controls, 8,000 markers) showed substantial inflation of χ2 statistics even after accounting for broad geographical region, with a portion of this inflation plausibly due to population structure (D. Clayton, personal communication).

Although we are positive in general about Bayesian statistical methods, we urge caution against viewing the Bayesian mixture approach (GCB), and more generally false discovery rates4,5,6, as a simple panacea to multiple testing issues. There are not often free lunches. The idea of GCB is to partition loci into two groups: those associated with the disease (outlier loci) and those not associated with the disease, using a sensible statistical model, and method, to assign loci to each group. Informally, this will be easy if the test statistics of outlier loci look very different from those of nonassociated loci, which would be the case if the genetic effects were large and if there were moderate numbers of loci in each category. On the other hand, for the small effects appropriate to complex diseases, genome scans with massive numbers of nonassociated loci and a small relative number of true disease loci, the tail of the null distribution (after GC) of test statistics may well overlap, or possibly even bury, the few values from associated loci, and no statistical procedure will reliably separate the two. These kinds of settings have not been extensively explored.

We conclude with two points of detail. It is false that our original paper1 assumed “subjects that originate from different populations”. Much of our focus (e.g., Fig. 4c–e and Fig. 6 in ref. 1) deliberately (and explicitly) concerned structure plausible within current populations. Finally, there are two different ways in which GC (or GCB or GCF) could fail in practice: (i) the null distribution of the test statistic may not behave as a simple multiple of a χ2 distribution, or (ii) the statistical allowance for the inflation factor may not be effective. The 'short cut' simulations given by Devlin et al above presuppose that the first point is not a problem. In the absence of a formal mathematical proof, and with abundant computing resources, it would seem better to check routinely both aspects of GC, as in their Table 1, rather than only the second, as in their Figures 1 and 2.