Stats: multiple experiments test biomedical conclusions

McGill University, Montreal, Quebec, Canada.

Search for this author in:

Valentin Amrhein and colleagues correctly point out (Nature 567, 305–307; 2019) that P values should not be used to classify scientific results as significant or non-significant (widely misinterpreted as ‘true’ or ‘not true’, respectively). However, scientists — in their dispositional revulsion towards subjectivity — routinely make a broader error.

Too many biomedical researchers still believe that single papers prove scientific points. If that were the case, the P values associated with the experiments would be important, and we could argue about what they mean and where significance thresholds should be set. Clinical scientists were disabused of this idea years ago: the results of meta-analyses routinely make a mockery of the conclusions of individual experiments.

Most high-profile preclinical papers describe multiple experiments that either depend on each other or converge on a conclusion (see J. S. Mogil and M. R. Macleod Nature 542, 409–411; 2017). The P value of each experiment is hardly relevant: the question is how many independent experiments were done in which the observed effect supports the conclusion. Even then, that conclusion would be valid only for the set of circumstances pertaining to those particular experiments.

For every conclusion, there is evidence for, evidence against, and uncertainty as to how far it can be generalized. Results are always provisional, P values or no.

Nature 569, 192 (2019)

doi: 10.1038/d41586-019-01454-6
Nature Briefing

Sign up for the daily Nature Briefing email newsletter

Stay up to date with what matters in science and why, handpicked from Nature and other publications worldwide.

Sign Up