To the Editor:

In their exchange of letters, van Helden1 and Halsey et al.2 debate the utility of the P value and of the confidence interval (CI) for interpreting experiments. In addition to the specific points raised, their exchange illustrates a clash of cultures that may be illuminating for readers to note. Namely, there are two broad mindsets:

The craftsman—a single P value is reported as an end result of the analysis of a predefined single question, as is the case for, say, a clinical trial or a small-scale biological experiment.

The industrialist—P values are used to summarize a screen of many hypotheses, as in gene expression analysis, genome-wide association studies and other types of high-throughput biology. Typically, such analyses involve iterative data exploration, and the 'result' is only an intermediate step, to be followed by more analysis. Importantly, the distribution of all the other P values gives a lot of contextual information for each particular P value.

A clash can arise between the craftsmen (exemplified by the arguments of Halsey and colleagues) and the industrialists (exemplified by van Helden). For instance, the claim made by Halsey et al.2 that “the problem with running the test many times is that this virtually never happens in practice” is true for the craftsman but blatantly wrong for large-scale testing. The figure presented by van Helden1 (including volcano plots and P value histograms) shows that he is thinking large.

How does this affect the alleged fickleness of the P value? A single P value can be fickle. In particular, if the null hypothesis is true (i.e., there is no effect) or if the analysis is underpowered, the P value can lie anywhere between 0 and 1 with equal probability, and therefore it will be irreproducible. However, the distribution of many P values, industrially produced, is very reproducible, by virtue of the law of large numbers. In fact, in large-scale testing, P values are easier to deal with than CIs. Multiple testing is naturally and intuitively reasoned about in terms of P values, whereas this is roundabout with CIs. The contextual information of all P values can be modeled using Bayesian concepts, such as local false discovery rates and empirical nulls3. Moderated tests4 can avoid some of the fickleness, and these approaches have been hugely successful.

Common to both sides' arguments is the observation that the P value alone is an insufficient summary of an inferential process. To usefully report the results of a statistical analysis, scientists should provide not only P values but also the underlying data and the complete analysis workflow, using a reporting tool such as Jupyter or Rmarkdown.