The reliability and reproducibility of science are under scrutiny. However, a major cause of this lack of repeatability is not being considered: the wide sample-to-sample variability in the P value. We explain why P is fickle to discourage the ill-informed practice of interpreting analyses based predominantly on this statistic.
Reproducible research findings are a cornerstone of the scientific method, providing essential validation. There has been recent recognition, however, that the results of published research can be difficult to replicate1,2,3,4,5,6,7, an awareness epitomized by a series in Nature entitled “Challenges in irreproducible research” and by the Reproducibility Initiative, a project intended to identify and reward reproducible research (http://validation.scienceexchange.com/#/reproducibilityinitiative). In a recent meeting at the American Association for the Advancement of Science headquarters involving many of the major journals reporting biomedical science research, a common set of principles and guidelines was agreed upon for promoting transparency and reproducibility8. These discussions and initiatives all focused on a number of issues, including aspects of statistical reporting9, levels of statistical power (i.e., sufficient statistical capacity to find an effect; a 'statistically significant' finding)10 and inclusion-exclusion criteria. Yet a fundamental problem inherent in standard statistical methods, one that is pervasively linked to the lack of reproducibility in research, remains to be considered: the wide sample-to-sample variability in the P value. This omission reflects a general lack of awareness about this crucial issue, and we address this matter here.
Focusing on the P value during statistical analysis is an entrenched culture11,12,13. The P value is often used without the realization that in most cases the statistical power of a study is too low for P to assist the interpretation of the data (Box 1). Among the many and varied reasons for a fearful and hidebound approach to statistical practice, a lack of understanding is prominent14. A better understanding of why P is so unhelpful should encourage scientists to reduce their reliance on this misleading concept.
Readers may know of the long-standing philosophical debate about the value and validity of null-hypothesis testing15,16,17. Although the P value formalizes null-hypothesis testing, this article will not revisit these issues. Rather, we concentrate on how P values themselves are misunderstood.
Although statistical power is a central element in reliability18, it is often considered only when a test fails to demonstrate a real effect (such as a difference between groups): a 'false negative' result (see Box 2 for a glossary of statistical terms used in this article). Many scientists who are not statisticians do not realize that the power of a test is equally relevant when considering statistically significant results, that is, when the null hypothesis appears to be untenable. This is because the statistical power of the test dramatically affects our capacity to interpret the P value and thus the test result. It may surprise many scientists to discover that interpreting a study result from its P value alone is spurious in all but the most highly powered designs. The reason for this is that unless statistical power is very high, the P value exhibits wide sample-to-sample variability and thus does not reliably indicate the strength of evidence against the null hypothesis (Box 1).
We give a step-by-step, illustrated explanation of how statistical power affects the reliability of the P value obtained from an experiment, with reference to previous Points of Significance articles published in Nature Methods, to help convey these issues. We suggest that, for this reason, the P value's preeminence16 is unjustified and arguments about null-hypothesis tests become virtually irrelevant. Researchers would do better to discard the P value and use alternative statistical measures for data interpretation.
The misunderstanding about P
Ronald Fisher developed significance testing to make judgments about hypotheses19, arguing that the lower the P value, the greater the reason to doubt the null hypothesis20. He suggested using the P value as a continuous variable to aid judgment. Today, scientific articles are typically peppered with P values, and often treat P as a dichotomous variable, slavishly focusing on a threshold value of 0.05. Such focus is unfounded because, for instance, P = 0.06 should be considered essentially the same as P = 0.04; P values should not be given an aura of exactitude21,22. However, using P as a graded measure of evidence against the null hypothesis, as Fisher proposed, highlights the even more fundamental misunderstanding about P. If statistical power is limited, regardless of whether the P value returned from a statistical test is low or high, a repeat of the same experiment will likely result in a substantially different P value17 and thus suggest a very different level of evidence against the null hypothesis. Therefore, the P value gives little information about the probable result of a replication of the experiment; it has low test-retest reliability. Put simply, the P value is usually a poor test of the null hypothesis. Most researchers recognize that a small sample is less likely to satisfactorily reflect the population that they wish to study, as has been described in the Points of Significance series21, but they often do not realize that this effect will influence P values. There is variability in the P value23, but this is rarely mentioned in statistics textbooks or in statistics courses.
Indeed, most scientists employ the P value as if it were an absolute index of the truth. A low P value is automatically taken as substantial evidence that the data support a real phenomenon. In turn, researchers then assume that a repeat experiment would probably also return a low P value and support the original finding's validity. Thus, many studies reporting a low P value are never challenged or replicated. These single studies stand alone and are taken to be true. In fact, another similar study with new, different, random observations from the populations would result in different samples and thus could well return a P value that is substantially different, possibly providing much less apparent evidence for the reported finding.
Why statistical power is rarely sufficient for us to trust P
P values are only as reliable as the sample from which they have been calculated. A small sample taken from a population is unlikely to reliably reflect the features of that population21. As the number of observations taken from the population increases (i.e., sample size increases), the sample gives a better representation of the population from which it is drawn because it is less subject to the vagaries of chance. In the same way, values derived from these samples also become more reliable, and this includes the P value. Unfortunately, even when statistical power is close to 90%, a P value cannot be considered to be stable; the P value would vary markedly each time if a study were replicated. In this sense, P is unreliable. As an example, if a study obtains P = 0.03, there is a 90% chance that a replicate study would return a P value somewhere between the wide range of 0–0.6 (90% prediction intervals), whereas the chances of P < 0.05 is just 56% (ref. 24). In other words, the spread of possible P values from replicate experiments may be considerable and will usually range widely across the typical threshold for significance of 0.05. This may surprise many who believe that a test with 80% power is robust; however, this view comes from the accepted risk of a false negative.
To illustrate the variability of P values and why this happens, we will compare observations drawn from each of two normally distributed populations of data, A and B (Fig. 1). We know that a difference of 0.5 exists between the population means (the true effect size), but this difference may be concealed by the scatter of values within the population. We compare these populations by taking two random samples, one from A and the other from B. If we had to conserve resources, which could be necessary in practical situations, we might limit our two samples to ten observations each. In practice, we would conduct only one experiment, but let us consider the situation of having conducted four such simulated experiments (Fig. 2). For each experiment, we use standard statistics, such as the mean, to estimate features of the population from which the sample was drawn. In addition, and of more relevance, we can estimate the difference between the means (estimated effect size) and also calculate the P value for a two-tailed test. For the four repeated experiments, both the effect size and the P value vary, sometimes substantially, between the replicates (Fig. 2). This is because these small samples are affected by random variation (known as sampling variability). To improve the reliability of the estimated effect size, we can reduce the effects of random variation, and thus increase the power of the comparison, if we take more samples (Fig. 3). However, although increasing statistical power improves the reliability of P, we find that the P value remains highly variable for all but the very highest values of power.
Taking larger samples increases the chance of detecting a particular effect size (such as the difference between the populations), i.e., the frequency that we find a P < 0.05 (Fig. 4). Increasing sample size increases statistical power, and thus a progressively greater proportion of P values < 0.05 are obtained. However, we still face substantial variation in the magnitude of the P values returned. Although studies are often planned to have (an estimated) 80% power, when statistical power is indeed 80%, we still obtain a bewildering range of P values (Fig. 4). Thus, as Figure 4 shows, there will be substantial variation in the P value of repeated experiments. In reality, experiments are rarely repeated; we do not know how different the next P might be. But it is likely that it could be very different. For example, regardless of the statistical power of an experiment, if a single replicate returns a P value of 0.05, there is an 80% chance that a repeat experiment would return a P value between 0 and 0.44 (and a 20% change that P would be even larger). Thus, and as the simulation in Figure 4 clearly shows, even with a highly powered study, we are wrong to claim that the P value reliably shows the degree of evidence against the null hypothesis. Only when the statistical power is at least 90% is a repeat experiment likely to return a similar P value, such that interpretation of P for a single experiment is reliable. In such cases, the effect is so clear that statistical inference is probably not necessary25.
Most readers will probably appreciate that a large P value associated with 80% statistical power is poor evidence for lack of an important effect. Fewer understand that unless a small P value is extremely small, it provides poor evidence for the presence of an important effect. Most scientific studies have much less than 80% power, often around 50% in psychological research26 and averaging 21% in neuroscience10. Reporting and interpreting P values under such circumstances is of little or no benefit. Such limited statistical power might seem surprising, but it makes sense when considering that a medium effect size of 0.5 and sample sizes of 30 for each of two conditions provide statistical power of 49%. Weak statistical power results from small sample sizes—which are strongly encouraged in animal studies for ethical reasons but increase variability in the data sample—or from basing studies on previous works that report inflated effect sizes.
An additional problem with P: exaggerated effect sizes
Simulations of repeated t-tests also illustrate the tendency of small samples to exaggerate effects. This can be shown by adding an additional dimension to the presentation of the data. It is clear how small samples are less likely to be sufficiently representative of the two tested populations to genuinely reflect the small but real difference between them. Those samples that are less representative may, by chance, result in a low P value (Fig. 4). When a test has low power, a low P value will occur only when the sample drawn is relatively extreme. Drawing such a sample is unlikely, and such extreme values give an exaggerated impression of the difference between the original populations (Fig. 5). This phenomenon, known as the 'winner's curse', has been emphasized by others10. If statistical power is augmented by taking more observations, the estimate of the difference between the populations becomes closer to, and centered on, the theoretical value of the effect size (Fig. 5).
Alternatives to P
Poor statistical understanding leads to errors in analysis and threatens trust in research. Poorly reproducible studies impede and misdirect the progress of science, may do harm if the findings are applied therapeutically, and may discourage the funding of future research. The P value continues to be held up as the key statistic to report and interpret27,28, but we should now accept that this needs to change. In most cases, by simply accepting a P value, we ignore the scientific tenet of repeatability. We must accept this inconvenient truth about P values23 and seek an alternative approach to statistical inference. The natural desire for a single categorical yes-or-no decision should give way to a more mature process in which evidence is graded using a variety of measures. We may also need to reflect on the vast body of material that has already been published using standard statistical criteria. Previous reliance on P values emphasizes the need to reexamine previous results and replicate them if possible2,4 (http://validation.scienceexchange.com/#/reproducibilityinitiative).
We must consider alternative methods of statistical interpretation that could be used. Several options are available, and although no one approach is perfect15, perhaps the most intuitive and tractable is to report effect size estimates and their precision (95% confidence intervals (95% CIs; see Box 3 for statistical formulae discussed in this article)29,30, aided by graphical presentation31,32,33,34. This approach to statistical interpretation emphasizes the importance and precision of the estimated effect size, which answers the most frequent question that scientists ask: how big is the difference, or how strong is the relationship or association? In other words, although researchers may be conditioned to test null hypotheses (which are usually false34), they really want to find not only the direction of an effect but also its size and the precision of that estimate, so that the importance and relevance of the effect can be judged17,35,36.
Specifically, an effect size gives quantitative information about the magnitude of the relationship studied, and its 95% CIs indicate the uncertainty of that measure by presenting the range within which the true effect size is likely to lie (Fig. 6). To aid interpretation of the effect size, researchers may be well advised to consider what effect size they would deem important in the context of their study before data analysis.
Although effect sizes and their 95% CIs can be used to make threshold-based decisions about statistical significance in the same way that the P value can be applied, they provide more information than the P value17, in a more obvious and intuitive way37. In addition, the effect size and 95% CIs allow findings from several experiments to be combined with meta-analysis to obtain more accurate effect-size estimates, which is often the goal of empirical studies. Effect size can be appreciated most easily in the popular types of statistical analysis where a simple difference between group means is considered. However, even in other circumstances—such as measures of goodness of fit, correlation and proportions—effect sizes and, importantly, their 95% CIs, can also be expressed. Such tests and the software needed for the 95% CIs to be calculated and interpreted are readily available38. In addition, modern statistical methods such as bootstrap techniques and permutation tests have been developed for the analysis of small samples common in scientific studies39.
When interpreting data, many scientists appreciate that an estimate of effect size is relevant only within the context of a specific study. We should take this further and not only include effect sizes and their 95% CIs in analyses but also focus our attention on these values and discount the fickle P value. In turn, power analysis can be replaced with 'planning for precision', which calculates the sample size required for estimating the effect size to reach a defined degree of precision40.
The P value continues to occupy a prominent place within the conduct of research, and discovering that P is flawed will leave many scientists uneasy. As we have demonstrated, however, unless statistical power is very high (and much higher than in most experiments), the P value should be interpreted tentatively at best. Data analysis and interpretation must incorporate the uncertainty embedded in a P value.
We thank J.W. Huber, C.M. Bishop and P.A. Stephens for helpful comments on drafts of this manuscript.
About this article
Review of Philosophy and Psychology (2018)