The fickle P value generates irreproducible results

Halsey, Lewis G; Curran-Everett, Douglas; Vowler, Sarah L; Drummond, Gordon B

doi:10.1038/nmeth.3288

Commentary
Published: 26 February 2015

The fickle P value generates irreproducible results

Lewis G Halsey¹,
Douglas Curran-Everett²,
Sarah L Vowler³ &
…
Gordon B Drummond⁴

Nature Methods volume 12, pages 179–185 (2015)Cite this article

114k Accesses
467 Citations
295 Altmetric
Metrics details

Subjects

The reliability and reproducibility of science are under scrutiny. However, a major cause of this lack of repeatability is not being considered: the wide sample-to-sample variability in the P value. We explain why P is fickle to discourage the ill-informed practice of interpreting analyses based predominantly on this statistic.

You have full access to this article via your institution.

Download PDF

Reproducible research findings are a cornerstone of the scientific method, providing essential validation. There has been recent recognition, however, that the results of published research can be difficult to replicate^{1,2,3,4,5,6,7}, an awareness epitomized by a series in Nature entitled “Challenges in irreproducible research” and by the Reproducibility Initiative, a project intended to identify and reward reproducible research (http://validation.scienceexchange.com/#/reproducibilityinitiative). In a recent meeting at the American Association for the Advancement of Science headquarters involving many of the major journals reporting biomedical science research, a common set of principles and guidelines was agreed upon for promoting transparency and reproducibility⁸. These discussions and initiatives all focused on a number of issues, including aspects of statistical reporting⁹, levels of statistical power (i.e., sufficient statistical capacity to find an effect; a 'statistically significant' finding)¹⁰ and inclusion-exclusion criteria. Yet a fundamental problem inherent in standard statistical methods, one that is pervasively linked to the lack of reproducibility in research, remains to be considered: the wide sample-to-sample variability in the P value. This omission reflects a general lack of awareness about this crucial issue, and we address this matter here.

Focusing on the P value during statistical analysis is an entrenched culture^11,12,13. The P value is often used without the realization that in most cases the statistical power of a study is too low for P to assist the interpretation of the data (Box 1). Among the many and varied reasons for a fearful and hidebound approach to statistical practice, a lack of understanding is prominent¹⁴. A better understanding of why P is so unhelpful should encourage scientists to reduce their reliance on this misleading concept.

Readers may know of the long-standing philosophical debate about the value and validity of null-hypothesis testing^15,16,17. Although the P value formalizes null-hypothesis testing, this article will not revisit these issues. Rather, we concentrate on how P values themselves are misunderstood.

Although statistical power is a central element in reliability¹⁸, it is often considered only when a test fails to demonstrate a real effect (such as a difference between groups): a 'false negative' result (see Box 2 for a glossary of statistical terms used in this article). Many scientists who are not statisticians do not realize that the power of a test is equally relevant when considering statistically significant results, that is, when the null hypothesis appears to be untenable. This is because the statistical power of the test dramatically affects our capacity to interpret the P value and thus the test result. It may surprise many scientists to discover that interpreting a study result from its P value alone is spurious in all but the most highly powered designs. The reason for this is that unless statistical power is very high, the P value exhibits wide sample-to-sample variability and thus does not reliably indicate the strength of evidence against the null hypothesis (Box 1).

We give a step-by-step, illustrated explanation of how statistical power affects the reliability of the P value obtained from an experiment, with reference to previous Points of Significance articles published in Nature Methods, to help convey these issues. We suggest that, for this reason, the P value's preeminence¹⁶ is unjustified and arguments about null-hypothesis tests become virtually irrelevant. Researchers would do better to discard the P value and use alternative statistical measures for data interpretation.

The misunderstanding about P

Ronald Fisher developed significance testing to make judgments about hypotheses¹⁹, arguing that the lower the P value, the greater the reason to doubt the null hypothesis²⁰. He suggested using the P value as a continuous variable to aid judgment. Today, scientific articles are typically peppered with P values, and often treat P as a dichotomous variable, slavishly focusing on a threshold value of 0.05. Such focus is unfounded because, for instance, P = 0.06 should be considered essentially the same as P = 0.04; P values should not be given an aura of exactitude^21,22. However, using P as a graded measure of evidence against the null hypothesis, as Fisher proposed, highlights the even more fundamental misunderstanding about P. If statistical power is limited, regardless of whether the P value returned from a statistical test is low or high, a repeat of the same experiment will likely result in a substantially different P value¹⁷ and thus suggest a very different level of evidence against the null hypothesis. Therefore, the P value gives little information about the probable result of a replication of the experiment; it has low test-retest reliability. Put simply, the P value is usually a poor test of the null hypothesis. Most researchers recognize that a small sample is less likely to satisfactorily reflect the population that they wish to study, as has been described in the Points of Significance series²¹, but they often do not realize that this effect will influence P values. There is variability in the P value²³, but this is rarely mentioned in statistics textbooks or in statistics courses.

Indeed, most scientists employ the P value as if it were an absolute index of the truth. A low P value is automatically taken as substantial evidence that the data support a real phenomenon. In turn, researchers then assume that a repeat experiment would probably also return a low P value and support the original finding's validity. Thus, many studies reporting a low P value are never challenged or replicated. These single studies stand alone and are taken to be true. In fact, another similar study with new, different, random observations from the populations would result in different samples and thus could well return a P value that is substantially different, possibly providing much less apparent evidence for the reported finding.

Why statistical power is rarely sufficient for us to trust P

P values are only as reliable as the sample from which they have been calculated. A small sample taken from a population is unlikely to reliably reflect the features of that population²¹. As the number of observations taken from the population increases (i.e., sample size increases), the sample gives a better representation of the population from which it is drawn because it is less subject to the vagaries of chance. In the same way, values derived from these samples also become more reliable, and this includes the P value. Unfortunately, even when statistical power is close to 90%, a P value cannot be considered to be stable; the P value would vary markedly each time if a study were replicated. In this sense, P is unreliable. As an example, if a study obtains P = 0.03, there is a 90% chance that a replicate study would return a P value somewhere between the wide range of 0–0.6 (90% prediction intervals), whereas the chances of P < 0.05 is just 56% (ref. 24). In other words, the spread of possible P values from replicate experiments may be considerable and will usually range widely across the typical threshold for significance of 0.05. This may surprise many who believe that a test with 80% power is robust; however, this view comes from the accepted risk of a false negative.

To illustrate the variability of P values and why this happens, we will compare observations drawn from each of two normally distributed populations of data, A and B (Fig. 1). We know that a difference of 0.5 exists between the population means (the true effect size), but this difference may be concealed by the scatter of values within the population. We compare these populations by taking two random samples, one from A and the other from B. If we had to conserve resources, which could be necessary in practical situations, we might limit our two samples to ten observations each. In practice, we would conduct only one experiment, but let us consider the situation of having conducted four such simulated experiments (Fig. 2). For each experiment, we use standard statistics, such as the mean, to estimate features of the population from which the sample was drawn. In addition, and of more relevance, we can estimate the difference between the means (estimated effect size) and also calculate the P value for a two-tailed test. For the four repeated experiments, both the effect size and the P value vary, sometimes substantially, between the replicates (Fig. 2). This is because these small samples are affected by random variation (known as sampling variability). To improve the reliability of the estimated effect size, we can reduce the effects of random variation, and thus increase the power of the comparison, if we take more samples (Fig. 3). However, although increasing statistical power improves the reliability of P, we find that the P value remains highly variable for all but the very highest values of power.

**Figure 1: Simulated data distributions of two populations.**

**Figure 2: Small samples show substantial variation.**

**Figure 3: A larger sample size estimates effect size more precisely.**

Taking larger samples increases the chance of detecting a particular effect size (such as the difference between the populations), i.e., the frequency that we find a P < 0.05 (Fig. 4). Increasing sample size increases statistical power, and thus a progressively greater proportion of P values < 0.05 are obtained. However, we still face substantial variation in the magnitude of the P values returned. Although studies are often planned to have (an estimated) 80% power, when statistical power is indeed 80%, we still obtain a bewildering range of P values (Fig. 4). Thus, as Figure 4 shows, there will be substantial variation in the P value of repeated experiments. In reality, experiments are rarely repeated; we do not know how different the next P might be. But it is likely that it could be very different. For example, regardless of the statistical power of an experiment, if a single replicate returns a P value of 0.05, there is an 80% chance that a repeat experiment would return a P value between 0 and 0.44 (and a 20% change that P would be even larger). Thus, and as the simulation in Figure 4 clearly shows, even with a highly powered study, we are wrong to claim that the P value reliably shows the degree of evidence against the null hypothesis. Only when the statistical power is at least 90% is a repeat experiment likely to return a similar P value, such that interpretation of P for a single experiment is reliable. In such cases, the effect is so clear that statistical inference is probably not necessary²⁵.

**Figure 4: Sample size affects the distribution of P values.**

Most readers will probably appreciate that a large P value associated with 80% statistical power is poor evidence for lack of an important effect. Fewer understand that unless a small P value is extremely small, it provides poor evidence for the presence of an important effect. Most scientific studies have much less than 80% power, often around 50% in psychological research²⁶ and averaging 21% in neuroscience¹⁰. Reporting and interpreting P values under such circumstances is of little or no benefit. Such limited statistical power might seem surprising, but it makes sense when considering that a medium effect size of 0.5 and sample sizes of 30 for each of two conditions provide statistical power of 49%. Weak statistical power results from small sample sizes—which are strongly encouraged in animal studies for ethical reasons but increase variability in the data sample—or from basing studies on previous works that report inflated effect sizes.

An additional problem with P : exaggerated effect sizes

Simulations of repeated t-tests also illustrate the tendency of small samples to exaggerate effects. This can be shown by adding an additional dimension to the presentation of the data. It is clear how small samples are less likely to be sufficiently representative of the two tested populations to genuinely reflect the small but real difference between them. Those samples that are less representative may, by chance, result in a low P value (Fig. 4). When a test has low power, a low P value will occur only when the sample drawn is relatively extreme. Drawing such a sample is unlikely, and such extreme values give an exaggerated impression of the difference between the original populations (Fig. 5). This phenomenon, known as the 'winner's curse', has been emphasized by others¹⁰. If statistical power is augmented by taking more observations, the estimate of the difference between the populations becomes closer to, and centered on, the theoretical value of the effect size (Fig. 5).

**Figure 5: How sample size alters estimated effect size.**

Alternatives to P

Poor statistical understanding leads to errors in analysis and threatens trust in research. Poorly reproducible studies impede and misdirect the progress of science, may do harm if the findings are applied therapeutically, and may discourage the funding of future research. The P value continues to be held up as the key statistic to report and interpret^27,28, but we should now accept that this needs to change. In most cases, by simply accepting a P value, we ignore the scientific tenet of repeatability. We must accept this inconvenient truth about P values²³ and seek an alternative approach to statistical inference. The natural desire for a single categorical yes-or-no decision should give way to a more mature process in which evidence is graded using a variety of measures. We may also need to reflect on the vast body of material that has already been published using standard statistical criteria. Previous reliance on P values emphasizes the need to reexamine previous results and replicate them if possible^2,4 (http://validation.scienceexchange.com/#/reproducibilityinitiative).

We must consider alternative methods of statistical interpretation that could be used. Several options are available, and although no one approach is perfect¹⁵, perhaps the most intuitive and tractable is to report effect size estimates and their precision (95% confidence intervals (95% CIs; see Box 3 for statistical formulae discussed in this article)^29,30, aided by graphical presentation^31,32,33,34. This approach to statistical interpretation emphasizes the importance and precision of the estimated effect size, which answers the most frequent question that scientists ask: how big is the difference, or how strong is the relationship or association? In other words, although researchers may be conditioned to test null hypotheses (which are usually false³⁴), they really want to find not only the direction of an effect but also its size and the precision of that estimate, so that the importance and relevance of the effect can be judged^17,35,36.

Specifically, an effect size gives quantitative information about the magnitude of the relationship studied, and its 95% CIs indicate the uncertainty of that measure by presenting the range within which the true effect size is likely to lie (Fig. 6). To aid interpretation of the effect size, researchers may be well advised to consider what effect size they would deem important in the context of their study before data analysis.

**Figure 6: Characterizing the precision of effect size using the 95% CI of the difference between the means.**

Although effect sizes and their 95% CIs can be used to make threshold-based decisions about statistical significance in the same way that the P value can be applied, they provide more information than the P value¹⁷, in a more obvious and intuitive way³⁷. In addition, the effect size and 95% CIs allow findings from several experiments to be combined with meta-analysis to obtain more accurate effect-size estimates, which is often the goal of empirical studies. Effect size can be appreciated most easily in the popular types of statistical analysis where a simple difference between group means is considered. However, even in other circumstances—such as measures of goodness of fit, correlation and proportions—effect sizes and, importantly, their 95% CIs, can also be expressed. Such tests and the software needed for the 95% CIs to be calculated and interpreted are readily available³⁸. In addition, modern statistical methods such as bootstrap techniques and permutation tests have been developed for the analysis of small samples common in scientific studies³⁹.

When interpreting data, many scientists appreciate that an estimate of effect size is relevant only within the context of a specific study. We should take this further and not only include effect sizes and their 95% CIs in analyses but also focus our attention on these values and discount the fickle P value. In turn, power analysis can be replaced with 'planning for precision', which calculates the sample size required for estimating the effect size to reach a defined degree of precision⁴⁰.

The P value continues to occupy a prominent place within the conduct of research, and discovering that P is flawed will leave many scientists uneasy. As we have demonstrated, however, unless statistical power is very high (and much higher than in most experiments), the P value should be interpreted tentatively at best. Data analysis and interpretation must incorporate the uncertainty embedded in a P value.

Box 1: Power analysis and repeatability

A reasonable definition of the P value is that it measures the strength of evidence against the null hypothesis. However, unless statistical power is very high (>90%), the P value does not do this reliably. Power analysis combined with an either-or interpretation of the P value (simply either 'statistically significant' or 'statistically nonsignificant') allows us to estimate how often, if we were to conduct many replicate tests, a 'statistically significant result' will be found (assuming no type II errors)¹⁸. For instance, if the null hypothesis is false and a study has a power of 80%, then out of 100 replicates, about 80 of them will be deemed statistically significant. In this sense, statistical power quantifies the repeatability of the P value, but only in terms of the either-or interpretation. Furthermore, in the real world, the power of a study is not known; at best it can be estimated. Finally, this interpretation of P is flawed because the strength of evidence against the null hypothesis is a continuous function of the magnitude of P (ref. 41).

Box 2: Glossary

95% confidence intervals (95% CIs). The range of values around a sample statistic (typically the mean) that will in theory encompass the population statistic for roughly 95% of all samples drawn.

Effect size. A measure, sometimes normalized, of the magnitude of an observed effect. An effect measured in a sample is an estimate of the true (population) effect size. Interpretation of the P value is usually based on the assumption that the true effect size is 0.

False negative. See “Type II error.”

Normal distribution. Also called the Gaussian distribution; a frequency distribution that can be mathematically defined (see equations in Box 3) and that is assumed to be common empirically.

Null hypothesis. The backbone of a substantial number of statistical tests. The observer assumes that there is no difference between the samples and thus that they could have been drawn from the same population. The statistical test estimates the likelihood that the observed values, or more extreme values, would have been obtained if the null hypothesis were true.

P value. Two reasonable definitions are (i) the strength of evidence in the data against the null hypothesis and (ii) the long-run frequency of getting the same result or one more extreme if the null hypothesis is true.

Population. A very large group that a researcher wishes to characterize with measures such as the mean and the spread of the data but that is too vast to be collected exhaustively such that an exact measure of the population cannot be obtained.

(Random) sample. Measures taken randomly from a defined population of interest, which are used to provide an estimate of the characteristics of the population. The bigger the sample size, the more accurate the characterization of the population.

Replicate. A repeat procedure using a new sample from the appropriate population(s).

Sample size. The number of measures (observations) in the sample.

Standard deviation. An estimate of the mean variability (spread) of a sample.

Statistical power. A measure of the capacity of an experiment to find an effect (a 'statistically significant result') when there truly is an effect. This depends on several features of the experiment: the threshold for significance, size of the expected effect, variation present in the population, alternative hypothesis (one or two sided), nature of the test (paired or unpaired) and sample size. Power involves considering both the size of the effect that is deemed important and the background variation of the measure that is being taken, analogously to a signal-to-noise ratio. In most cases, the influence of natural variation can be reduced by increasing the sample size. With a greater sample size, the measure can be assessed more reliably because the features of the sampled population can be gauged more accurately.

(Threshold for) significance. The value at or below which P is interpreted as 'statistically significant'; this should be used only if the Neyman-Pearson approach to null-hypothesis testing is employed⁴².

Type II error (or 'false negative'). Incorrectly concluding that there is no effect in the population when there truly is an effect. (A type I error is the incorrect conclusion that there is an effect in the population when there truly is no effect.)

Box 3: Symbols and equations

Population parameters

μ Population mean

σ Population standard deviation

σ² Population variance

σ_ȳ Standard deviation of the sampling distribution of the sample mean

Sample statistics

y_i Sample observation i, where i = 1, 2, ..., n

s Sample standard deviation

s² Sample variance

n Number of observations

v Degrees of freedom

ȳ Sample mean

P Achieved significance level

Mean of a sample

where X_i is the ith subject or value.

Sample variance

Mean of a population

Confidence interval

For the difference between means of two independent populations:

where t_α/2 is the critical two-tailed value in the t-distribution for n₁ + n₂ − 2 degrees of freedom. There is a probability of 1 − α that this interval will contain the true difference between the population means.

Normal distribution

describes the distribution of values in a normal population.

t -statistic for two independent samples

For samples with an equal number of subjects in each group and the null hypothesis H₀: μ₁ = μ₂

References

Woolston, C. Nature 513, 283 (2014).
Article Google Scholar
Mobley, A., Linder, S.K., Braeuer, R., Ellis, L.M. & Zwelling, L. PLoS ONE 8, e63221 (2013).
Article CAS Google Scholar
Anonymous. Economist 26–30 (19 October 2013).
Russell, J.F. Nature 496, 7 (2013).
Article CAS Google Scholar
Bohannon, J. Science 344, 788–789 (2014).
Article CAS Google Scholar
Van Noorden, R. Nature News doi:10.1038/nature.2014.15509 (2014).
Anonymous. Nature 515, 7 (2014).
McNutt, M. Science 346, 679 (2014).
Article CAS Google Scholar
Vaux, D.L. Nature 492, 180–181 (2012).
Article CAS Google Scholar
Button, K.S. et al. Nat. Rev. Neurosci. 14, 365–376 (2013).
Article CAS Google Scholar
Nuzzo, R. Nature 506, 150–152 (2014).
Article CAS Google Scholar
Fidler, F., Burgman, M.A., Cumming, G., Buttrose, R. & Thomason, N. Conserv. Biol. 20, 1539–1544 (2006).
Article Google Scholar
Tressoldi, P.E., Giofré, D., Sella, F. & Cumming, G. PLoS ONE 8, e56180 (2013).
Article CAS Google Scholar
Sharpe, D. Psychol. Methods 18, 572–582 (2013).
Article Google Scholar
Ellison, A.M., Gotelli, N.J., Inouye, B.D. & Strong, D.R. Ecology 95, 609–610 (2014).
Article Google Scholar
Murtaugh, P.A. Ecology 95, 611–617 (2014).
Article Google Scholar
Cohen, J. Am. Psychol. 49, 997–1003 (1994).
Article Google Scholar
Krzywinski, M. & Altman, N. Nat. Methods 10, 1139–1140 (2013).
Article CAS Google Scholar
Fisher, R.A. Statistical Methods for Research Workers (Oliver and Boyd, 1925).
Google Scholar
Fisher, R.A. Statistical Methods and Scientific Inference 2nd edn. (Hafner, 1959).
Google Scholar
Krzywinski, M. & Altman, N. Nat. Methods 10, 809–810 (2013).
Article CAS Google Scholar
McCormack, J., Vandermeer, B. & Allan, G.M. BMC Med. Res. Methodol. 13, 134 (2013).
Article Google Scholar
Boos, D.D. & Stefanski, L.A. Am. Stat. 65, 213–221 (2011).
Article Google Scholar
Cumming, G. Perspect. Psychol. Sci. 3, 286–300 (2008).
Article Google Scholar
Cumming, G. Psychol. Sci. 25, 7–29 (2014).
Article Google Scholar
Maxwell, S.E. Psychol. Methods 9, 147–163 (2004).
Article Google Scholar
Salsburg, D.S. Am. Stat. 39, 220–223 (1985).
Google Scholar
Johnson, V.E. Proc. Natl. Acad. Sci. USA 110, 19313–19317 (2013).
Article CAS Google Scholar
Johnson, D.H. J. Wildl. Mgmt. 63, 763–772 (1999).
Article Google Scholar
Krzywinski, M. & Altman, N. Nat. Methods 10, 921–922 (2013).
Article CAS Google Scholar
Masson, M.E. & Loftus, G.R. Can. J. Exp. Psychol. 57, 203–220 (2003).
Article Google Scholar
Drummond, G.B. & Vowler, S.L. J. Physiol. (Lond.) 589, 1861–1863 (2011).
Article CAS Google Scholar
Lavine, M. Ecology 95, 642–645 (2014).
Article Google Scholar
Loftus, G.R. Behav. Res. Methods Instrum. Comput. 25, 250–256 (1993).
Article Google Scholar
Martínez-Abraín, A. Acta Oecol. 34, 9–11 (2008).
Article Google Scholar
Nakagawa, S. & Cuthill, I.C. Biol. Rev. Camb. Philos. Soc. 82, 591–605 (2007).
Article Google Scholar
Curran-Everett, D. Adv. Physiol. Educ. 33, 87–90 (2009).
Article Google Scholar
Grissom, R.J. & Kim, J.J. Effect Sizes for Research: Univariate and Multivariate Applications 2nd edn. (Routledge, 2011).
Google Scholar
Fearon, P. Psychologist 16, 632–635 (2003).
Google Scholar
Maxwell, S.E., Kelley, K. & Rausch, J.R. Annu. Rev. Psychol. 59, 537–563 (2008).
Article Google Scholar
Rosnow, R.L. & Rosenthal, R. Am. Psychol. 44, 1276–1284 (1989).
Article Google Scholar
Lew, M.J. Br. J. Pharmacol. 166, 1559–1567 (2012).
Article CAS Google Scholar

Download references

Acknowledgements

We thank J.W. Huber, C.M. Bishop and P.A. Stephens for helpful comments on drafts of this manuscript.

Author information

Authors and Affiliations

Lewis G. Halsey is in the Department of Life Sciences, University of Roehampton, London, UK,
Lewis G Halsey
Douglas Curran-Everett is in the Division of Biostatistics and Bioinformatics, and the Department of Biostatistics and Informatics, National Jewish Health, Colorado School of Public Health, University of Colorado Denver, Denver, Colorado, USA,
Douglas Curran-Everett
Sarah L. Vowler is at Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK,
Sarah L Vowler
Gordon B. Drummond is at the University of Edinburgh, Edinburgh, UK.,
Gordon B Drummond

Authors

Lewis G Halsey
View author publications
You can also search for this author in PubMed Google Scholar
Douglas Curran-Everett
View author publications
You can also search for this author in PubMed Google Scholar
Sarah L Vowler
View author publications
You can also search for this author in PubMed Google Scholar
Gordon B Drummond
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lewis G Halsey.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Halsey, L., Curran-Everett, D., Vowler, S. et al. The fickle P value generates irreproducible results. Nat Methods 12, 179–185 (2015). https://doi.org/10.1038/nmeth.3288

Download citation

Published: 26 February 2015
Issue Date: March 2015
DOI: https://doi.org/10.1038/nmeth.3288