The fickle P value generates irreproducible results

Journal name:
Nature Methods
Volume:
12,
Pages:
179–185
Year published:
DOI:
doi:10.1038/nmeth.3288
Published online

The reliability and reproducibility of science are under scrutiny. However, a major cause of this lack of repeatability is not being considered: the wide sample-to-sample variability in the P value. We explain why P is fickle to discourage the ill-informed practice of interpreting analyses based predominantly on this statistic.

At a glance

Figures

  1. Simulated data distributions of two populations.
    Figure 1: Simulated data distributions of two populations.

    The difference between the mean values is 0.5, which is the true (population) effect size. The standard deviation (the spread of values) of each population is 1.

  2. Small samples show substantial variation.
    Figure 2: Small samples show substantial variation.

    We drew samples of ten values at random from each of the populations A and B from Figure 1 to give four simulated comparisons. Horizontal lines denote the mean. We give the estimated effect size (the difference in the means) and the P value when the sample pairs are compared.

  3. A larger sample size estimates effect size more precisely.
    Figure 3: A larger sample size estimates effect size more precisely.

    We drew random samples of the indicated sizes from each of the two simulated populations in Figure 1 and made 1,000 simulated comparisons for each sample size. We assessed the precision of the effect size from each comparison using the 95% CI range. The histograms show the distributions of these 95% CI ranges for different sample sizes. As sample size increased, both the range and scatter of the confidence intervals decreased, reflecting increased power and greater precision from larger sample sizes. The vertical scale of each histogram has been adjusted so that the height of each plot is the same.

  4. Sample size affects the distribution of P values.
    Figure 4: Sample size affects the distribution of P values.

    We drew random samples of the indicated sizes from each of the two simulated populations in Figure 1 and made 1,000 simulated comparisons with a two-sample t-test for each sample size. The distribution of P values is shown; it varies substantially depending on the sample size. Above each histogram we show the number of P values at or below 0.001, 0.01, 0.05 (red) and 1. The empirical power is the percentage of simulations in which the true difference of 0.5 is detected using a cutoff of P < 0.05. These broadly agree with the theoretical power.

  5. How sample size alters estimated effect size.
    Figure 5: How sample size alters estimated effect size.

    Using the indicated sample sizes, we simulated a two-sample t-test 1,000 times at each sample size using the populations in Figure 1. Right panels, estimated effect size (y axis) and the associated P value (x axis) for each simulation. Red dots show single simulations, and the contours outline increasing density of their distribution. For example, for a sample size of 64, the simulations cluster around P = 0.01 and an estimated effect size of 0.50. Each right y axis is labeled with the biggest and smallest effect sizes from simulations where P < 0.05. The true (population) effect size of 0.50 is indicated on the left y axis. Left panels, distribution of effect size for 'statistically significant' simulations (i.e., observed P < 0.05). When the sample size is 30 (power = 48%), the estimated effect size exceeds the true difference in 97% of simulations (shaded columns). For samples of 100 (power = 94%), the estimated effect size exceeds the true effect size in roughly half (55%) the simulations.

  6. Characterizing the precision of effect size using the 95% CI of the difference between the means.
    Figure 6: Characterizing the precision of effect size using the 95% CI of the difference between the means.

    Top, four simulated comparisons of the populations in Figure 1, using each of the indicated sample sizes (the first four pairs are those shown in Fig. 2). Bottom, mean difference between each pair of samples, with 95% CIs. The dotted line represents the true effect size. With large samples, the effect size is consistent and precisely determined and the 95% CIs do not include 0.

References

  1. Woolston, C. Nature 513, 283 (2014).
  2. Mobley, A., Linder, S.K., Braeuer, R., Ellis, L.M. & Zwelling, L. PLoS ONE 8, e63221 (2013).
  3. Anonymous. Economist 2630 (19 October 2013).
  4. Russell, J.F. Nature 496, 7 (2013).
  5. Bohannon, J. Science 344, 788789 (2014).
  6. Van Noorden, R. Nature News doi:10.1038/nature.2014.15509 (2014).
  7. Anonymous. Nature 515, 7 (2014).
  8. McNutt, M. Science 346, 679 (2014).
  9. Vaux, D.L. Nature 492, 180181 (2012).
  10. Button, K.S. et al. Nat. Rev. Neurosci. 14, 365376 (2013).
  11. Nuzzo, R. Nature 506, 150152 (2014).
  12. Fidler, F., Burgman, M.A., Cumming, G., Buttrose, R. & Thomason, N. Conserv. Biol. 20, 15391544 (2006).
  13. Tressoldi, P.E., Giofré, D., Sella, F. & Cumming, G. PLoS ONE 8, e56180 (2013).
  14. Sharpe, D. Psychol. Methods 18, 572582 (2013).
  15. Ellison, A.M., Gotelli, N.J., Inouye, B.D. & Strong, D.R. Ecology 95, 609610 (2014).
  16. Murtaugh, P.A. Ecology 95, 611617 (2014).
  17. Cohen, J. Am. Psychol. 49, 9971003 (1994).
  18. Krzywinski, M. & Altman, N. Nat. Methods 10, 11391140 (2013).
  19. Fisher, R.A. Statistical Methods for Research Workers (Oliver and Boyd, 1925).
  20. Fisher, R.A. Statistical Methods and Scientific Inference 2nd edn. (Hafner, 1959).
  21. Krzywinski, M. & Altman, N. Nat. Methods 10, 809810 (2013).
  22. McCormack, J., Vandermeer, B. & Allan, G.M. BMC Med. Res. Methodol. 13, 134 (2013).
  23. Boos, D.D. & Stefanski, L.A. Am. Stat. 65, 213221 (2011).
  24. Cumming, G. Perspect. Psychol. Sci. 3, 286300 (2008).
  25. Cumming, G. Psychol. Sci. 25, 729 (2014).
  26. Maxwell, S.E. Psychol. Methods 9, 147163 (2004).
  27. Salsburg, D.S. Am. Stat. 39, 220223 (1985).
  28. Johnson, V.E. Proc. Natl. Acad. Sci. USA 110, 1931319317 (2013).
  29. Johnson, D.H. J. Wildl. Mgmt. 63, 763772 (1999).
  30. Krzywinski, M. & Altman, N. Nat. Methods 10, 921922 (2013).
  31. Masson, M.E. & Loftus, G.R. Can. J. Exp. Psychol. 57, 203220 (2003).
  32. Drummond, G.B. & Vowler, S.L. J. Physiol. (Lond.) 589, 18611863 (2011).
  33. Lavine, M. Ecology 95, 642645 (2014).
  34. Loftus, G.R. Behav. Res. Methods Instrum. Comput. 25, 250256 (1993).
  35. Martínez-Abraín, A. Acta Oecol. 34, 911 (2008).
  36. Nakagawa, S. & Cuthill, I.C. Biol. Rev. Camb. Philos. Soc. 82, 591605 (2007).
  37. Curran-Everett, D. Adv. Physiol. Educ. 33, 8790 (2009).
  38. Grissom, R.J. & Kim, J.J. Effect Sizes for Research: Univariate and Multivariate Applications 2nd edn. (Routledge, 2011).
  39. Fearon, P. Psychologist 16, 632635 (2003).
  40. Maxwell, S.E., Kelley, K. & Rausch, J.R. Annu. Rev. Psychol. 59, 537563 (2008).
  41. Rosnow, R.L. & Rosenthal, R. Am. Psychol. 44, 12761284 (1989).
  42. Lew, M.J. Br. J. Pharmacol. 166, 15591567 (2012).

Download references

Author information

Affiliations

  1. Lewis G. Halsey is in the Department of Life Sciences, University of Roehampton, London, UK

  2. Douglas Curran-Everett is in the Division of Biostatistics and Bioinformatics, National Jewish Health, and the Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, Denver, Colorado, USA

  3. Sarah L. Vowler is at Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK

  4. Gordon B. Drummond is at the University of Edinburgh, Edinburgh, UK.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Additional data