Points of Significance: Interpreting P values

Journal name:
Nature Methods
Volume:
14,
Pages:
213–214
Year published:
DOI:
doi:10.1038/nmeth.4210
Published online

A P value measures a sample's compatibility with a hypothesis, not the truth of the hypothesis.

At a glance

Figures

  1. Using a Bayesian heuristic to interpret the P value.
    Figure 1: Using a Bayesian heuristic to interpret the P value.

    (a) Power drops at more stringent P value cutoffs α. The curve is based on a two-sample t-test with n = 10 and an effect size of 1.32. (b) The Benjamin and Berger bound calibrates the P value to probability statements about the hypothesis. At P = 0.05, the bound suggests that our alternative hypothesis is at most 2.5 times more likely than the null (black dashed line). Also shown are the conventional Bayesian = 20 (blue dashed line; P = 0.0032) cutoff and = 14 (orange dashed line; P = 0.005), suggested by Johnson in ref. 2. (c) Use of the more stringent Benjamin and Berger bounds in b reduces the power of the test, because now testing is performed at a < 0.05. For = 14 (orange dashed line; α = 0.005), the power is only 43%. The blue and orange dashed lines show the same bounds as in b. In all panels, black dotted lines are present to help the reader locate values discussed in the text.

  2. Interpretation of the P value with heuristics based on the false discovery rate (FDR) and by examination of P values across a range of hypotheses.
    Figure 2: Interpretation of the P value with heuristics based on the false discovery rate (FDR) and by examination of P values across a range of hypotheses.

    (a) The relationship between the estimated FDR (eFDR) and the proportion of tests expected to be null, π0, when testing at α = 0.05. Dashed lines indicate Altman's proposals2 for π0. (b) The profile of P values for our biomarker example (n = 10, sp = 1.1). The dashed line at P = 0.05 cuts the curve at the boundaries of the 95% confidence interval (0.17, 2.23), shown as an error bar. (c) P value percentiles (shown by contour lines) and 95% range (gray shading) expected from a two-sample t-test as effect size is increased. At each effect size d, data were simulated from 100,000 normally distributed samples (n = 10 per sample) with means 0 and d, respectively, and σ2 = 1. The fraction of P values smaller than α is the power of the test—for example, 80% (blue contour) are smaller than 0.05 for d = 1.32 (blue dashed line). When d = 0, P values are randomly uniformly distributed.

References

  1. Altman, N. & Krzywinski, M. Nat. Methods 14, 34 (2017).
  2. Wasserstein, R. & Lazar, N.A. Am. Stat. 70, 129133 (2016).
  3. López Puga, J., Krzywinski, M. & Altman, N. Nat. Methods 12, 277278 (2015).
  4. Jarosz, A.F. & Wiley, J.J. J. Probl. Solving 7, 2 (2014).
  5. Selke, T. et al. Am. Stat. 55, 6271 (2001).

Download references

Author information

Affiliations

  1. Naomi Altman is a Professor of Statistics at The Pennsylvania State University.

  2. Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.

Author details

Additional data