CORRESPONDENCE

Stats: multiple experiments test biomedical conclusions

Valentin Amrhein and colleagues correctly point out (Nature 567, 305–307; 2019) that P values should not be used to classify scientific results as significant or non-significant (widely misinterpreted as ‘true’ or ‘not true’, respectively). However, scientists — in their dispositional revulsion towards subjectivity — routinely make a broader error.

Too many biomedical researchers still believe that single papers prove scientific points. If that were the case, the P values associated with the experiments would be important, and we could argue about what they mean and where significance thresholds should be set. Clinical scientists were disabused of this idea years ago: the results of meta-analyses routinely make a mockery of the conclusions of individual experiments.

Most high-profile preclinical papers describe multiple experiments that either depend on each other or converge on a conclusion (see J. S. Mogil and M. R. Macleod Nature 542, 409–411; 2017). The P value of each experiment is hardly relevant: the question is how many independent experiments were done in which the observed effect supports the conclusion. Even then, that conclusion would be valid only for the set of circumstances pertaining to those particular experiments.

For every conclusion, there is evidence for, evidence against, and uncertainty as to how far it can be generalized. Results are always provisional, P values or no.

Nature 569, 192 (2019)

Nature Briefing

An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday.

Subjects

Sign up to Nature Briefing

An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday.

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing