Statistics: P values are just the tip of the iceberg

Leek, Jeffrey T.; Peng, Roger D.

doi:10.1038/520612a

Download PDF

Comment
Published: 28 April 2015

Statistics: P values are just the tip of the iceberg

Jeffrey T. Leek¹ &
Roger D. Peng¹

Nature volume 520, page 612 (2015)Cite this article

8657 Accesses
124 Citations
1921 Altmetric
Metrics details

Subjects

Ridding science of shoddy statistics will require scrutiny of every step, not merely the last one, say Jeffrey T. Leek and Roger D. Peng.

There is no statistic more maligned than the P value. Hundreds of papers and blogposts have been written about what some statisticians deride as 'null hypothesis significance testing' (NHST; see, for example, go.nature.com/pfvgqe). NHST deems whether the results of a data analysis are important on the basis of whether a summary statistic (such as a P value) has crossed a threshold. Given the discourse, it is no surprise that some hailed as a victory the banning of NHST methods (and all of statistical inference) in the journal Basic and Applied Social Psychology in February¹.

Such a ban will in fact have scant effect on the quality of published science. There are many stages to the design and analysis of a successful study (see 'Data pipeline'). The last of these steps is the calculation of an inferential statistic such as a P value, and the application of a 'decision rule' to it (for example, P < 0.05). In practice, decisions that are made earlier in data analysis have a much greater impact on results — from experimental design to batch effects, lack of adjustment for confounding factors, or simple measurement error. Arbitrary levels of statistical significance can be achieved by changing the ways in which data are cleaned, summarized or modelled².

P values are an easy target: being widely used, they are widely abused. But, in practice, deregulating statistical significance opens the door to even more ways to game statistics — intentionally or unintentionally — to get a result. Replacing P values with Bayes factors or another statistic is ultimately about choosing a different trade-off of true positives and false positives. Arguing about the P value is like focusing on a single misspelling, rather than on the faulty logic of a sentence.

Better education is a start. Just as anyone who does DNA sequencing or remote-sensing has to be trained to use a machine, so too anyone who analyses data must be trained in the relevant software and concepts. Even investigators who supervise data analysis should be required by their funding agencies and institutions to complete training in understanding the outputs and potential problems with an analysis.

There are online courses specifically designed to address this crisis. For example, the Data Science Specialization, offered by Johns Hopkins University in Baltimore, Maryland, and Data Carpentry, can easily be integrated into training and research. It is increasingly possible to learn to use the computing tools relevant to specific disciplines — training in Bioconductor, Galaxy and Python is included in Johns Hopkins' Genomic Data Science Specialization, for instance.

But education is not enough. Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time 'panel data', to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as 'longitudinal data', and often go at it with generalized estimating equations.

Statistical research largely focuses on mathematical statistics, to the exclusion of the behaviour and processes involved in data analysis. To solve this deeper problem, we must study how people perform data analysis in the real world. What sets them up for success, and what for failure? Controlled experiments have been done in visualization³ and risk interpretation⁴ to evaluate how humans perceive and interact with data and statistics. More recently, we and others have been studying the entire analysis pipeline. We found, for example, that recently trained data analysts do not know how to infer P values from plots of data⁵, but they can learn to do so with practice.

The ultimate goal is evidence-based data analysis⁶. This is analogous to evidence-based medicine, in which physicians are encouraged to use only treatments for which efficacy has been proved in controlled trials. Statisticians and the people they teach and collaborate with need to stop arguing about P values, and prevent the rest of the iceberg from sinking science.

References

Trafimow, D. & Marks, M. Basic Appl. Soc. Psych. 37, 1–2 (2015).
Article Google Scholar
Simmons, J. P., Nelson, L. D. & Simonsohn, U. Psychol. Sci. 22, 1359–1366 (2011).
Article Google Scholar
Cleveland, W. S. & McGill, R. Science 229, 828–833 (1985).
Article CAS ADS Google Scholar
Kahneman, D. & Tversky, A. Econometrica 47, 263–291 (1979).
Article MathSciNet Google Scholar
Fisher, A., Anderson, G. B., Peng, R. & Leek, J. PeerJ 2, e589 (2014).
Article Google Scholar
Leek, J. T. & Peng, R. D. Proc. Natl Acad. Sci. USA 112, 1645–1646 (2015).
Article CAS ADS Google Scholar

Download references

Author information

Authors and Affiliations

Jeffrey T. Leek and Roger D. Peng are associate professors of biostatistics at the Johns Hopkins Bloomberg School of Public Health in Baltimore, Maryland, USA.,
Jeffrey T. Leek & Roger D. Peng

Authors

Jeffrey T. Leek
View author publications
You can also search for this author in PubMed Google Scholar
Roger D. Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jeffrey T. Leek.

Cite this article

Leek, J., Peng, R. Statistics: P values are just the tip of the iceberg. Nature 520, 612 (2015). https://doi.org/10.1038/520612a

Download citation

Published: 28 April 2015
Issue Date: 30 April 2015
DOI: https://doi.org/10.1038/520612a

This article is cited by

A synthesis of evidence for policy from behavioural science during COVID-19
- Kai Ruggeri
- Friederike Stock
- Robb Willer
Nature (2024)
Holistic Individual Fire Preparedness in Informal Settlements, Bangladesh
- Md Mostafizur Rahman
- Saadmaan Jubayer Khan
- Komal Raj Aryal
Fire Technology (2022)
Frequentist Model-based Statistical Induction and the Replication Crisis
- Aris Spanos
Journal of Quantitative Economics (2022)
Disorder predispositions and protections of Labrador Retrievers in the UK
- Camilla Pegram
- Charlotte Woolley
- Dan G. O’Neill
Scientific Reports (2021)
Hidden analyses: a review of reporting practice and recommendations for more transparent reporting of initial data analyses
- Marianne Huebner
- Werner Vach
- Lara Lusa
BMC Medical Research Methodology (2020)