Halsey et al. reply:

van Helden argues an old point: that the mathematics underlying P and confidence intervals (CIs) are the same, and thus the two variables give the same information. But in our original Commentary, although we offered CIs as an alternative, we specifically mentioned other options. Our paper was not about CIs but about the fickleness of P, and having criticized P we wished to broach other, arguably better analysis methods that readers might consider. Further, although CIs could be used to make P-value-like threshold decisions as we acknowledged in our paper, this would be an unfortunate application. In other words, the point is missed that our 'suggestion to use CIs' is really a suggestion to focus data interpretation on the size of the estimated effect rather than on whether the results are 'significant' or 'not significant'. With the focus on the effect size, CIs provide a way to assess the 'margin of error' around that effect estimate. This approach to data analysis moves things away from significance testing, and that is our main recommendation.

van Helden discusses other reasons for variability in the P value. However, the simulation we conducted (and which he repeated) in fact avoids all but one of the problems in real experiments. Because we used a theoretical 'perfect' set of data, we were studying merely the inability of insufficient samples to yield representative results—so P is fickle even when an experiment is 'perfect'. The problem with running the test many times is that this virtually never happens in practice. With these simulations we become 'all-seeing' about how experimental results can pan out. Real life gives only one chance at a study, and the fickleness of P indicates that whether we end up with a winning or losing hand has much to do with luck.