Box 1. Significance levels in genome-wide studies

From the following article:

Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls

The Wellcome Trust Case Control Consortium

Nature 447, 661-678(7 June 2007)



There has been much debate concerning interpretation of significance levels in genome-wide association studies and whether, and how, these should be corrected for multiple testing. Classical multiple testing theory in statistics is concerned with the problem of 'multiple tests' of a single 'global' null hypothesis. This, we would argue, is a problem far removed from that which faces us in genome-wide association studies, where we face the problem of testing 'multiple hypotheses' (for a particular disease, one hypothesis for each SNP, or region of correlated SNPs, in the genome) and we thus do not subscribe to the view that one should correct significance levels for the number of tests performed to obtain 'genome-wide significance levels'. Nonetheless, our aim is to keep the false positive rate within acceptable bounds and this still leads to the view that very low P values are needed for strong evidence of association. But the factor determining the threshold is not the number of tests performed, but the a priori probability that there is likely to be a true association at any specified location in the genome. Of course, we cannot know this prior probability from objective evidence, but we can perhaps estimate an order of magnitude.

There are two linked questions. The first concerns the choice of an appropriate 'threshold' for reporting possible associations as likely to be genuine. Here the mathematics is quite straightforward if we make the simplifying assumption that we have the same power to detect all true associations. Then we have18

Posterior odds for true association = Prior odds times Power/Significance threshold

That is, for a given significance threshold, the probability of a true association depends on the prior odds and, crucially, the power. A plausible estimate for the prior odds of true association at any specified locus might be of the order of 100,000:1 against, for example, on the basis of 1,000,000 'independent' regions of the genome and an expectation of 10 detectable genes involved in the condition. (Other plausible estimates might vary from this by an order of magnitude or so in either direction.) Then, assuming a power of 0.5 and a significance threshold of 5 times 10-7, the posterior odds in favour of a 'hit' being a true association would be 10:1. However, if we relax this significance threshold by a factor of ten, or alternatively if the power were lower by a factor of 10, the posterior odds that a 'hit' is a true association would also be reduced by a factor of ten. This simple mathematical analysis is little affected by allowing for the fact that true associations come in various sizes with varying power to detect them; the above formula is simply modified by interpreting 'power' as the mean power.

The above discussion concerns 'average' properties of 'hits' achieving given significance levels. After the association data are available, a related but different question is whether a particular positive finding is likely to be a true one. For that calculation, the prior odds must be multiplied by the Bayes factor, the ratio of the probability of the observed data under the assumption that there is a true association to its probability under the null hypothesis. As in power calculations, the calculation of Bayes factors requires assumptions about effect sizes (see Methods for details).

A key point from both perspectives is that interpreting the strength of evidence in an association study depends on the likely number of true associations, and the power to detect them which, in turn, depends on effect sizes and sample size. In a less-well-powered study it would be necessary to adopt more stringent thresholds to control the false-positive rate. Thus, when comparing two studies for a particular disease, with a hit with the same MAF and P value for association, the likelihood that this is a true positive will in general be greater for the study that is better powered, typically the larger study. In practice, smaller studies often employ less stringent P-value thresholds, which is precisely the opposite of what should occur.