Psychological measures aren’t toothbrushes

Elson, Malte; Hussey, Ian; Alsalti, Taym; Arslan, Ruben C.

doi:10.1038/s44271-023-00026-9

Download PDF

Comment
Open access
Published: 17 October 2023

Psychological measures aren’t toothbrushes

Communications Psychology volume 1, Article number: 25 (2023) Cite this article

12k Accesses
16 Citations
140 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 22 April 2024

This article has been updated

Most psychological measures are used only once or twice. This proliferation and variability threaten the credibility of research. The Standardisation Of BEhavior Research (SOBER) guidelines aim to ensure that psychological measures are standardised and, unlike toothbrushes, reused by others.

Psychological constructs and measures suffer from the toothbrush problem¹: no self-respecting psychologist wants to use anyone else’s. This causes proliferation: many measures are used only once or twice (Fig. 1) and there is no tendency for researchers to agree on measures over time (Fig. 2). Proliferation happens because researchers promote their own brands, because discovering reusable measures in the large, fragmented academic literature is difficult, and simply for its own sake. At first glance, measurement proliferation may seem negligible, or even beneficial – after all, who would oppose studying the same phenomenon through multiple methodological approaches? Here, we argue that proliferation is in fact a serious barrier to cumulative science.

**Fig. 1: Measurement proliferation in psychology.**

**Fig. 2: Fragmentation of the psychological literature.**

A jingle-jangle of labels

Some measures actually quantify different things, but share similar labels (or even identical ones: In APA PsycTests, no less than 19 different tests go by “theory of planned behavior questionnaire”, 15 by “job satisfaction scale”, and 11 by “self-efficacy scale”). Other measures quantify the same thing as existing measures but under a different label. Known as the Jingle and Jangle fallacies, these are common and well-documented threats to the replicability and validity of psychological research, e.g. in studies on emotion². They involve a nominal fallacy: that a measure’s name tells you about its contents or what it measures³.

Undisclosed flexibility

Even when authors profess using the same measure of the same construct, all is not yet well because disclosed and undisclosed measurement flexibility, i.e. changes to a measure with known or unknown psychometric consequences, is common⁴. Dropping, adding, and altering items in self-report scales, aggregating total scores in various ways in laboratory tasks, or varying stimuli and trial durations all occur while researchers not only refer to the same construct, but actually to the same nominal instrument⁵. Even when all decisions are disclosed, only a methodological literature review will reveal that many studies used, for instance, unique aggregation algorithms, scoring strategies, or items, often with unknown psychometric consequences.

Generalising across samples

Current test norms are necessary for valid comparisons of individual test scores to population values⁶. Even widely used measures have typically never been normed in the population in which they are being used, or the available test norms are badly outdated. Without current norms, users of the measures cannot judge whether their sample selection procedure engendered bias, which makes it harder to judge generalisability. Supposedly ‘standardised’ effect sizes, such as correlations, are biased when all that is available for standardisation is the restricted within-sample standard deviation⁷. Such effect sizes may not be directly aggregated across samples without complex corrections - but nonetheless routinely are in meta-analyses.

Literatures that seem coherent and rigorous to the casual or even the experienced reader might in fact be anything but. For a reasonable synthesis of the evidence, meta-analysts would have to correct for the differences in sample selection, variability, reliability, and any other measurement-driven sources of heterogeneity. However, doing so properly is often difficult because the required information is missing⁷ and bias correction techniques rely on often untestable assumptions.

Hence, (a) the lack of strong empirical or procedural norms in measurement, (b) the lack of transparency in reporting, and (c) the lack of common referents (i.e., test norms) in measurement are an enormous threat to meaningful evidence cumulation and research synthesis. For instance, to boost the reliability coefficient of an established scale in a primary study, researchers frequently drop what they argue to be ‘poorly performing’ items⁸. However, this approach is itself unreliable and produces inconsistent recommendations which item to drop⁹. Without also obtaining an out-of-sample norm for this modified measure, or a replication showing that dropping the chosen item(s) consistently improves measurement, this does not improve the current or future use of the measurement-especially when such item dropping is not fully reported, as is often the case⁸: a measure cannot be improved when changes to it are not communicated.

Consequently, when measurement flexibility is present studies with measures using the same label may not be meaningfully compared, either directly with each other or in a larger research synthesis. Further, new validity evidence for an existing measure can only be applied quite narrowly, or with great uncertainty when it is unclear which studies match the validated protocol. And even the most peculiar decisions in a study can remain unnoticed, and their implications unknown, if no meaningful comparison against a proven standard can be made. Not being able to detect that study results are the outcome of a fishing expedition can result in a seemingly homogeneous literature that is actually the product of a trawling conglomerate.

The SOBER guidelines

Psychology should be serious about standardising its measures - and it currently is not. But who should enforce this call to raise the bar on measurement and rein in ad-hockery? There are many stakeholders able to shape scientists’ behaviours through meaningful policies, but we believe journals in general, and in particular psychology journals promoting robust science - including Communications Psychology, should implement policies to raise the quality of psychological science. Here, we propose the Standardisation Of BEhavior Research (SOBER) guidelines that specifically address issues of flexibility and norming in measurement (see Table 1).

Table 1 The SOBER policies for psychological journals; how authors comply with them, and how they should be enforced by editors and reviewers.

Full size table

Moving forward

Across the psychological landscape, we call on research communities to (1) recognise measurement flexibility, similar to the widely acknowledged issues related to p-hacking, as a serious threat to scientific credibility, and (2) set and use continuously updated and validated standards for measurement and standard reporting guidelines that are maintained and updated with new evidence.

To this end, we envision an open repository of measurement protocols, one that fills the gap left by academic publishers after the decline of paper-pencil measures, with implementations for widely used study tools, test norms, standard scoring rules, and machine-readable metadata. This repository would facilitate discovery of measures, scrutiny of design decisions, meta-science through a systematic assessment of research practices. It would permit large language models to independently assess semantic overlap and reliability, and possibly even estimate discriminant and convergent validity, of newly developed measures against those already stored in a repository.

Scrutiny of the details of previous work’s measures is necessary to both inform how we should interpret existing findings and to increase measures’ future reuse potential. Transparency about the fine grain details of our measures allows others to reuse them with fidelity, and allows for the fidelity of measures to be checked between studies. These aspects of transparency and their scientific benefits have yet to be tapped by our field. If we want to build a cumulative evidence base in psychology, we need to standardise our measures and protocols. Psychologists need to stop remixing and recycling, and start reusing (measures, not toothbrushes).

Change history

22 April 2024
A Correction to this paper has been published: https://doi.org/10.1038/s44271-024-00086-5

References

Mischel W. The Toothbrush Problem. APS Obs. 21. Accessed May 22, 2023. https://www.psychologicalscience.org/observer/the-toothbrush-problem. (2008).
Weidman, A. C., Steckler, C. M. & Tracy, J. L. The jingle and jangle of emotion assessment: imprecise measurement, casual scale usage, and conceptual fuzziness in emotion research. Emot. Wash DC 17, 267–295 (2017).
Google Scholar
Lilienfeld, S. O. & Strother, A. N. Psychological measurement and the replication crisis: four sacred cows. Can. Psychol. Psychol. Can. 61, 281–288 (2020).
Article Google Scholar
John, L. K., Loewenstein, G. & Prelec, D. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23, 524–532 (2012).
Article PubMed Google Scholar
Elson, M. Examining psychological science through systematic meta-method analysis: a call for research. Adv. Methods Pract. Psychol. Sci. 2, 350–363 (2019).
Article Google Scholar
Crocker L. M., Algina J. Introduction to Classical and Modern Test Theory. Cengage Learning (2008).
Wiernik, B. M. & Dahlke, J. A. Obtaining unbiased results in meta-analysis: the importance of correcting for statistical artifacts. Adv. Methods Pract. Psychol. Sci. 3, 94–123 (2020).
Article Google Scholar
Cortina, J. M. et al. From alpha to omega and beyond! A look at the past, present, and (possible) future of psychometric soundness in the Journal of Applied Psychology. J. Appl. Psychol. 105, 1351–1381 (2020).
Article PubMed Google Scholar
Kopalle, P. K. & Lehmann, D. R. Alpha inflation? The impact of eliminating scale items on Cronbach’s Alpha. Organ Behav. Hum. Decis. Process 70, 189–197 (1997).
Article Google Scholar
Wilcox, A. R. Indices of qualitative variation and political measurement. West Polit. Q 26, 325–343 (1973).
Article Google Scholar

Download references

Acknowledgements

This research was supported by the META-REP Priority Program of the German Research Foundation (#464488178). The funders had no role in the decision to publish or preparation of the manuscript. The authors thank the American Psychological Association for their support and providing access to the APA PsycTests database.

Author information

Authors and Affiliations

Institute of Psychology, University of Bern, Fabrikstrasse 8, 3012, Bern, Switzerland
Malte Elson & Ian Hussey
Department of Psychology, University of Leipzig, Leipzig, Germany
Taym Alsalti & Ruben C. Arslan

Authors

Malte Elson
View author publications
You can also search for this author in PubMed Google Scholar
Ian Hussey
View author publications
You can also search for this author in PubMed Google Scholar
Taym Alsalti
View author publications
You can also search for this author in PubMed Google Scholar
Ruben C. Arslan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualisation: M.E., I.H., T.A., R.C.A., visualisation: R.C.A., writing—original draft: M.E., and writing—review and editing: M.E., I.H., T.A., R.C.A.

Corresponding author

Correspondence to Malte Elson.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Psychology thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editors: Marike Schiffer. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Elson, M., Hussey, I., Alsalti, T. et al. Psychological measures aren’t toothbrushes. Commun Psychol 1, 25 (2023). https://doi.org/10.1038/s44271-023-00026-9

Download citation

Received: 19 June 2023
Accepted: 12 September 2023
Published: 17 October 2023
DOI: https://doi.org/10.1038/s44271-023-00026-9

This article is cited by

Designing and evaluating tasks to measure individual differences in experimental psychology: a tutorial
- Marc Brysbaert
Cognitive Research: Principles and Implications (2024)
A year of growth

Communications Psychology (2024)
Proliferation of measures contributes to advancing psychological science
- Dragos Iliescu
- Samuel Greiff
- Donald Saklofske
Communications Psychology (2024)
Distilling the concept of authenticity
- Constantine Sedikides
- Rebecca J. Schlegel
Nature Reviews Psychology (2024)