INTRODUCTION

Despite much initial optimism, genetic association studies have been far from entirely successful. A decade ago, Risch and Merikangas1 argued that association analysis could provide a more statistically powerful framework to dissect complex diseases than linkage analysis under certain circumstances. This fact, compounded by an avalanche of SNPs that later became steadily available, triggered great enthusiasm in association studies. Several association studies have been performed and published in journals spanning a wide range of interdisciplinary interests. However, the results have been less than impressive: most of these studies seem to have failed one of the benchmarks of scientific success, namely replicability.2 A review article by Hirschhorn et al.3 found that, of more than 600 associations previously reported, 166 were studied at least thrice; of these, only six were consistently replicated. Thus, the reality today seems to be one of skepticism, if not downright pessimism.49 What has gone wrong? Should we eventually abandon association studies? More precisely, is a 3.6% replication rate really bad? In this article, we show that the concept of replication has been largely misunderstood and that the genetics community has unintentionally put overly high replication expectations on initially significant findings (even when these are quite significant). In brief, we contend that a 3.6% replication rate, when looked at in the right statistical light, is really not surprising and should not give way to disillusionment.

MAPPING BY POPULATION ASSOCIATION

A genetic association exists between a phenotype (e.g., disease) and a specific allele at a genetic marker if the allele occurs more often (than would be expected by chance) in a group of individuals with the disease (cases) compared with a group without the disease (controls). Whereas linkage analysis is concerned with the co-segregation of marker loci with disease within families, association analysis studies the dependence between marker alleles and disease at a population level (see, for example, Hodge10).

For association mapping by linkage disequilibrium to be successful, two conditions are desirable.11 First, the disease allele must have arisen only once in the population so that there is complete linkage disequilibrium between the marker and disease alleles (hence small isolated populations are often preferred). Second, the marker and disease loci must be in very close physical proximity, so that the disequilibrium between the marker and disease alleles is the result of tight linkage (typically the recombination fraction θ ≈ 10−5). Problems in linkage disequilibrium mapping arise when the disequilibrium in question is instead the result of population history, natural selection, or population stratification. In all these cases, an association between marker allele and disease can be observed, but the disease locus is not necessarily close to the marker locus (in the cases of population history or selection) or is even nonexistent (in the case of population stratification).

Population stratification can be especially deleterious to association studies12,13, and much effort has recently been channeled to further understand and correct for this confounder.1420 If population stratification was to be corrected for and other design issues were to be improved in future association studies, should we expect a sudden leap in successful replication rates? We explain why such an expectation will likely remain unfulfilled in years to come. The main problem lies in a common misunderstanding of the meaning of a replication and in how likely it is to occur. Furthermore, we will show that many of the apparent failures to replicate have, in fact, been “pseudo-failures.”

REPLICATION FALLACY, REPLICATION PROBABILITY, AND P VALUES

We use “replication” to refer to a situation in which a null hypothesis that has been rejected at Time 1 is rejected again, and with the same direction of outcome, on the basis of a new study at Time 221 (Table 1).

Table 1 The meaning of a replication

If a test of association is rejected with P = 0.01, what is the probability the study will be replicated? Oakes22 reports that, in a survey of 70 academic psychologists each with at least 2 years' research experience, 60% endorsed the statement that if an initial study is significant with P = 0.01, then “You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great many times, you would obtain a significant result on 99% of occasions.” Nothing could be further from the truth. In point of fact, as we will soon show, when α = 0.05, an initial study with P = 0.05 implies a replication probability of only 0.5, one with P = 0.02 implies only approximately 0.64 replication probability, and one with P = 0.01 results in no more than approximately 0.73 probability of replication! These results: (1) hold irrespective of the sample size used as long as the sample sizes in the initial and subsequent studies are equal (the case of unequal sample sizes will be discussed later); (2) assume a z test (we discuss deviations from this assumption below); and (3) are based on the concept of power. The incorrect belief that 1 − p is the probability of replication of an initial study is known as the replication fallacy.23,24 The replication fallacy is a reason why the scientific community generally has overly high expectations that an initial study that yields a relatively low P value should be replicated, and if that does not happen, then surely the initial finding must have been false. Nothing, again, could be further from the truth. The whole issue is succinctly described by Oakes22:

“We have seen that the power of a replication of an independent means t-test design when the first experiment has an associated probability of 0.02 is approximately 0.67 … Suppose then, psychologist B, suspecting that A's results were an artefact … decided to perform an exact replication of A's study. Suppose B's results were in the same direction as A's but were not significant. It would be folly surely for B to assert that A's findings were indeed artefactual.”

It is precisely this kind of apparent replication failure that we describe as a pseudo-failure because the probability of a successful replication is a priori quite low, and a subsequent “failure” should come as neither a surprise nor a contradiction of the initial positive finding.

If the P value is not a direct measure of the replicability of an initial study, is there anything at all it can tell about the likelihood of replication? The short answer is yes. Although neither the exact P value nor its complement can be interpreted as the probability of a successful replication, it has been shown that, for a given test size α, “the replicability of null hypothesis rejection is a continuous, increasing function of the complement of its P value”.25 Although this fact is well known in the social and behavioral sciences22,2427, it seems to have been overlooked by the genetics community, and its implications have not before been explored as we do in this article.

Let us now consider an initial association study based on a simple χ2 test (Table 2), which yields a P value p1 for a test size α, where p1 ≤α. It can be shown that the initial finding, assuming it is not a false positive, will be replicated with probability (see Appendix)

Table 2 Contingency table and χ2 test statistics for first and second studies

where n1, n2 are the sample sizes (i.e., the number of cases or controls) in the first and second studies, respectively, zcrit = Φ−1 (1−α/2), z1 = Φ−1 (1−p1/2), and Φ, Φ−1 are the standard normal distribution function and inverse distribution function, respectively. Note that, in accordance with our definition of a replication, Equation 1 will be used only when: (1) the initial test is significant (i.e., p1 ≤ α); and (2) the outcomes of both tests are in the same direction (i.e., a1c1 and a2c2 in Table 2 have the same signs). Moreover, although Equation 1 assumes an equal number of cases and controls in the first study, and similarly equal numbers in the second study, situations with different numbers of cases and controls can be dealt with through the use of harmonic means. More specifically, any study containing r cases and s controls (where rs) can be treated as one with the same number of cases and controls where t = 2rs/(r + s).28 Equation 1 implicitly assumes the same test size α for both the first and subsequent studies: for unequal test sizes, say α* and α, respectively, the same equation can be used as long as p1 ≤ α*. Finally, we also note that the whole issue of replicability can also be approached from a Bayesian perspective.29

The critical assumption made in Equation 1 is that the expected effect size in the second study equals the observed effect size in the initial study (Greenwald et al.25). Therefore, the replication probability calculated is, in effect, the conditional power of the second study to be statistically significant given this assumption about effect size (hence, “replication probability” and “replication power” are used interchangeably from now on). Regarding this assumption, Greenwald et al.25 state that it “involves a step of inductive reasoning that is (a) well recognized to lack rigorous logical foundation but is (b) nevertheless essential to ordinary scientific activity.” Of more pertinence is the fact that, when calculating replication probabilities, the literature has usually assumed the second study is an exact repetition of the first study25,27, in the sense that both studies: (1) have the same sample size; and (2) are performed using the same population. Clearly, this is the norm neither in association studies nor elsewhere, especially the second condition. Regarding the first condition, all replication calculations are based on the assumption of constancy of effect size, as we explained at the start of the paragraph. Now, because effect size is unaffected by sample size24, there is enough justification to use the effect size in a study with a different sample size. However, it is very desirable for the first study to have a reasonably large sample size so that the standard error of its observed effect size is relatively small. For examples on the use of different sample sizes in replication calculations, based on initial effect sizes, see Tversky and Kahneman30 and Heils.31 Regarding the second condition, Tan et al.32 point out, “Estimates of effect size will tend to regress to the true effect size in subsequent [association] studies, which is usually less extreme.” The same feeling has been echoed elsewhere,7,8,33 and Göring et al.5 have provided a theoretical justification. Thus, the expected effect size of the second study will, on average, be smaller than the observed effect size of the first study. What this means is that, in general, even when a different population is used for the subsequent study, Equation 1 can still be used, with the understanding that the replication probabilities calculated from that equation actually represent upper bounds for the true replication probabilities.25

Let us now examine three of the more unexpected consequences of Equation 1, when both the first and second studies have similar sample sizes (i.e., n2n1):

Consequence 1

A P value only slightly less than the nominal α in the first study (e.g., P = 0.04 at α = 0.05) results in a replication power of only approximately 50% for the second study (see Fig. 1).

Fig. 1
figure 1

Variation of replication probability as a function of the P value of the χ2 test in the first study. The first and second studies are assumed to have the same sample size. For the top curve, α = 0.05; for the bottom curve, α = 0.01. For other values of α, use Equation 1.

Consequence 2

Reasonably low P values in the first study do not necessarily result in high replication power of the second study (e.g., P = 0.02 at α = 0.05 implies a replication power of no more than 64%) (see Fig. 1).

Consequence 3

To achieve a replication power of 80% for the second study at α = 0.05, a P value of at most 0.005 must be obtained in the first study (see Fig. 1).

When n1 and n2 are allowed to be different, two more consequences are

Consequence 4

For reasonably large sample sizes, the replication power depends only on the ratio of the initial and subsequent sample sizes, not on their absolute values (e.g., if P = 0.02 at α = 0.05, then pREP =.727 when n 1 = 100 and n 2 = 120, and pREP =.723 when n 1 = 500 and n 2 = 600).

Consequence 5

For a given sample size of the first study, if the initial P value is only slightly less than the nominal α, the sample size required for the second study must be much larger to achieve a replication power of 80%. Suppose an initial association study with n1 cases and n1 controls yields a P value p1. Then, the number of cases (or controls) required for the second study to achieve a replication power of pREP follows directly from Equation 1

where, as before, zcrit = Φ−1 (1−α/2), z1 = Φ−1 (1−p1/2),Φ−1 is the inverse standard normal distribution function, and we have assumed z12 2n1. Figure 2 illustrates the variation of n2/n1 with p1. For example, if the initial study has P = 0.04 at α = .05, then the sample size of the second study must be approximately 1.86 times that of the first study for a replication power of 80%. (Because the expected effect size of the second study will, on average, be smaller than the observed effect size of the first study, as we explained earlier, the actual required relative sample size will be greater than 1.86.)

Fig. 2
figure 2

Variation of the relative sample size (number of cases or controls) required for the second study as a function of the P value of the χ2 test in the first study to achieve a replication probability of 80%.

IMPLICATION FOR ASSOCIATION STUDIES AND DISCUSSION

Initial association studies that achieve borderline significance, and even those that result in relatively low P values, result in low replication probability (or power) for a subsequent study. Therefore, failure to replicate such an initial finding in a subsequent study should neither come as a surprise nor be deemed “troubling;” nor is this failure to replicate necessarily an outright refutation of the initial finding.

What should one do with an initial study that yields a P value of, say, 0.03? We leave it to the reader to decide, but consider this: if roughly the same sample size is used for both the initial and subsequent studies, the replication power is <58%, and no clinical trial with a power of 58% would ever pass a review board. If one insists on replicating, one should do so with the understanding that there are big chances of not succeeding. There is no denying that consistent replication and subsequent biological confirmation should be the gold standard. However, replication of some initial findings just might not be within the realm of high probability.

We have, however, seen that an association study that yields a P value of 0.005 at α = 0.05 implies that a subsequent study will have almost 80% probability of replication (assuming the expected effect size of the second study is the observed effect size of the first). Does this mean that any time one obtains an initial P value of 0.005, one should expect 80% of future studies to indeed result in a confirmation of the initial finding? To answer this question, one must remember that the replication power in Equation 1 is calculated on the assumption that the effect size observed in the initial study is the population effect size in the subsequent study. If this is indeed the case, one would obtain 80% replication rate in future findings. (Because the original effect size will more likely be overestimated, the actual success rate will be slightly lower than 80%, as explained previously). However, if the initial finding is a false positive, which will occur at a rate of 5%, the replication rate will be much lower than 80%.29

This point leads us to a vital consideration before Equation 1 can be applied. It is extremely important for the association study to be well designed so that sources of bias either are minimal or have been removed. Although this will not completely eliminate false positives, it will keep them as low as possible. Otherwise, it will be less likely for the expected effect size in the subsequent study to be even close to the observed effect size of the first study. For example, suppose an investigator reports a P value of 0.005, but the design used suffers from considerable population stratification. Then, this significant finding will more likely be a false positive, the observed effect size will very likely be a gross overestimate of the actual effect size, and application of Equation 1 will lead to exaggerated confidence in the replication power of a subsequent study (i.e., the true replication power will be much lower than 80%).

A critical issue for genome screen association studies concerns corrections for multiple testing. The question then arises as to how to compute the replication power after observing one or more positive results in a genome screen. Which test size α should one use, and which P value? The answer may vary depending on the situation. The simplest case is one in which a single association peak from a genome scan is chosen to study for replication, i.e., at a locus at which a few apparently highly associated SNPs are tested for replication. It would then be misleading to use the test with the smallest P value, while still assuming a test size α, to calculate the replication power. Because a genome-wide association study could potentially consist of an extremely large number of tests, an FDR correction procedure34 is appealing. The appropriate P values then to use in the computation of replication power are those that are significant with the FDR procedure. In the special case when only a few tests are performed and a Bonferroni adjustment of the test size α is chosen, a P value that is significant at α/C could be used to compute the replication power for the subsequent test.

In this article, we focused on the replication power for a second study based on the P value of only one first study. This is mainly because, in many cases, only one or two replications have been attempted, as can be seen, for example, from the survey conducted by Hirschchorn.3 However, it also legitimate to ask how Equation 1 could be used if we wished to compute the replication power based on the results of several initial studies, i.e., based on a meta-analytic approach.21,35 In this case, we propose computing the average of the effect sizes (θ¯ 1) of the initial studies (see Appendix) and estimating the initial effective sample size (nh) from the harmonic mean of the initial sample sizes. The value of z1 can then be calculated by using , and Equation 1 can be used to calculate the replication power of the second study, with n1 = nh. However, there are three points to note with such an approach: (1) all initial studies (whether significant or not) must be used in the calculation of z1; (2) the value of z1 must be at least as large as zcrit; and (3) sources of bias, e.g., population stratification, in the initial studies must be corrected before θ¯1 is calculated; otherwise the latter will be overestimated.

How robust is Equation 1 to distributional and sample size assumptions? It is important for the sample sizes of the second and especially the first studies to be reasonably large for at least two reasons: (1) so that χ2 tests can be legitimately used; and (2) so that the effect size in the first experiment can be consistently estimated. When testing for two proportions using small sample sizes, replication power can still be calculated from a noncentral hypergeometric distribution, but it will not be accurate. When comparing two means using the t-test, formulas for the replication power have been given by Greenwald et al.25 and Posavac.27 Finally, Consequence 1 (i.e., pREP ≈ .5 when an initial study has a P value only slightly less than α) always holds irrespective of the underlying distribution, as long as the latter is symmetric.

The rationale for assuming that an expected value of the effect size in a second study is equal to the observed value of the statistic in a first study, which is at the basis of Equation 1, can be legitimately questioned. Referring more specifically to the odds ratio, Fleiss et al.36 point out that, whereas differences in proportions would vary between studies, measures of effect size could remain constant from study to study. Murphy and Myors37 further state that the effect size “provides a simple metric that allows for comparison of treatment effects from different studies, areas or research.” In fact, as we mentioned previously, it is more justifiable to assume that the expected effect size in the second study is, on average, slight less than the observed effect size in the initial study. Consequently, Equation 1 really gives the replication power for the best-case scenario, so that the actual replication power of the second study will be slightly less than that given by Equation 1.

Despite the various deficiencies of the P value, which have been discussed at great length elsewhere,26,3840 we believe that, in addition to measures of effect size and confidence intervals, researchers should continue to report P values “because tests of statistical significance provide information that effect sizes and confidence intervals do not”.27 We advocate, as does Greenwald et al.,25 the reporting of exact P values, rather than expressions such as P < 0.01 or P > 0.05. Whenever a subsequent study is planned, researchers should compute replication power based on the initial P value and on the sample sizes of the initial and subsequent studies.

The literature on genetic association studies is rife with admonitions and possible explanations for their non-replications. The reasons are usually one or more of the following6,8,9: (1) population stratification; (2) genetic heterogeneity; (3) inflation in Type I error; and (4) gene-environment interaction. We do not deny that these are important problems, and attempts should be made to correct for them. But even if these problems were to be remedied, trying to replicate many initial findings, even if they are quite significant, may be predisposed to failure and should not be interpreted as necessarily contradicting the initial association. To our knowledge, only one publication in the genetic-epidemiology literature5 has acknowledged this fact. Moreover, only one publication31 on association studies has actually reported calculations for the power of a replication (based on the initial observed effect size), but even that power calculation seems to be slightly inflated.

When we examined five of the six studies surveyed by Hirschhorn et al.,3 for which P values could be obtained and which had been consistently replicated, we found that all of them had P < 10−3. Thus, subsequent studies of these with similar sample sizes had approximately 99.8% replication power! We also randomly sampled 50 of the 166 initial studies that had reported exact P values. We found that: (1) 78% had P values > 0.005, implying a subsequent study of similar sample size would be underpowered (a replication power of <80%); (2) 38% had P values greater or equal to 0.02, implying a subsequent study of similar sample size would be seriously underpowered (a replication power of <64%). Although it is true that many of the replication failures reported by Hirschhorn et al.3 could be the result of the several reasons mentioned by these authors (e.g., population stratification, heterogeneity), it is equally true that, even for the remaining cases in which these sources of error were actually minimal, replication failures were bound to occur, because of their low a priori replication power. Furthermore, even if sources of bias had actually been minimal in most cases, replication success would still be low. On a second look, therefore, Hirschhorn et al.'s3 review should perhaps not look so depressing, because it contains many potential pseudo-failures, that is, replications that were just not meant to be. We must stress, however, that our article in no way attempts to challenge Hirschhorn et al.'s arguments. It acknowledges these, but it also adds a whole new perspective to the issue of replication: old problems still need to be grappled with, because they reduce the odds of false positives, but even if they are addressed, low replication power will continue to deny high replication rates in future association studies.

In conclusion, if the P value in one's initial study is very small (e.g., P = 0.005 when α = 0.05), then one can indeed anticipate a high replication probability. However, for more commonly observed P values (e.g., P = 0.02 and even P = 0.01, when α = 0.05), replication probabilities are notably lower than one might hope. In these situations, one should not be surprised by a failure to replicate.