Proliferation of measures contributes to advancing psychological science

Proliferation and variability of psychological measures are part of the natural workings of the scientific process. They contribute to theory development, transparency, replicability, and validity, and restricting proliferation might hinder progress in the psychological sciences.


ORCID
Communications Psychology is committed to improving transparency in authorship.As part of our efforts in this direction, we are now requesting that all authors identified as 'corresponding author' create and link their Open Researcher and Contributor Identifier (ORCID) with their account on the Manuscript Tracking System (MTS) prior to acceptance.ORCID helps the scientific community achieve unambiguous attribution of all scholarly contributions.For more information please visit http://www.springernature.com/orcidFor all corresponding authors listed on the manuscript, please follow the instructions in the link below to link your ORCID to your account on our MTS before submitting the final version of the manuscript.If you do not yet have an ORCID you will be able to create one in minutes.https://www.springernature.com/gp/researchers/orcid/orcid-for-nature-researchIMPORTANT: All authors identified as 'corresponding author' on the manuscript must follow these instructions.Non-corresponding authors do not have to link their ORCIDs but are encouraged to do so.Please note that it will not be possible to add/modify ORCIDs at proof.Thus, if they wish to have their ORCID added to the paper they must also follow the above procedure prior to acceptance.
To support ORCID's aims, we only allow a single ORCID identifier to be attached to one account.If you have any issues attaching an ORCID identifier to your MTS account, please contact the <a href="http://platformsupport.nature.com/">PlatformSupport Helpdesk</a>.
Please use the following link to submit the above items: [link redacted]** This url links to your confidential home page and associated information about manuscripts you may have submitted or be reviewing for us.If you wish to forward this email to coauthors, please delete the link to your homepage first ** We hope to hear from you within four weeks; please let us know if the process may take longer.

REVIEWERS' EXPERTISE:
Reviewer #1: measurement, reproducibility Reviewer #2: measurement, reproducibility REVIEWERS' COMMENTS: Reviewer #1 (Remarks to the Author): Mixed feelings about this potential contribution: although comment on original work is, of course, always welcome, the original work presented large-scale data for supporting the views, suggestions, and proposals of the original authors, whilst this submitted commentary does not present any further data.Hence, it constitutes an opinion piece throughout, and I found it difficult to agree with the authors' various claims, starting from the statement in the opening paragraph that the suggested guidelines are "prescriptive" and would be "artificially restricting".Again, line 77: "viewed by us as overly restrictive".My understanding was and is that these guidelines currently constitute a grassroots and bottom-up voluntary proposal, i.e., you and me and others CAN use it as a guidanceso why "overly restrictive"?Lines 39-40: the example presented here, i.e., the RSES (Rosenberg Self-Esteem Scale) in my view actually is a counter-example, because self-esteem is a quite broad concept, but the RSES has been very widely used and, in fact, is one of the most widely used (and thus most widely cited) self-report instruments across psychological science.Regarding the statement that the RSES item pool should be rephrased "when used in a sample of school children, and if used in a sample of immigrants": I wondered whether any such contextualizing, very specific, adaptations of the RSES are out there (the authors do not cite any) -this claim might perhaps be more a form of social constructivism to which psychometricians would not agree.Also, some strawman argumentation seems to be put forward: RE "a single measure" of psychopathy -this is not the case, and I think it has not been argued for.
I also have to disagree with the claim that "proliferation and variability of psychological measures are a direct result of the scientific process, and this process leads not to fewer but to more (and higher-quality) measures".Why should this exactly be the case?One would surmise that progress eventually leads to weeding out of the less than optimal measures.What is absent in this argumentation is the key concept of incremental validity in psychological assessment and scale development and validation: an alternative, novel measures must show superiority relative to existing measures.This key concept of incremental validity as one important, informative facet of the validation framework is glaringly absent in this text.
RE "fragmentation can be interpreted as the continuous development of psychological science": I have to disagree here as well.Same argumentation as above: think, for instance, of the phenomena of consilience, universality, generalizability across and within science fields.From an epistemic point of view, fragmentation of concepts and knowledge can hardly be a sign of scientific progress.This debatable argument reappears throughout the text, e.g., line 87: "science progresses … through diversity".
Another claim is (one again, presented without any refs) that "even minor adaptations in a psychological measure lead to (substantially) different results".Is this really the case, and would this be generalizable?Think of parallel test forms; they usually work quite well.Line 108, "(more) openly sharing information about any and all measures".This was already included in the SOBER guidelines.
The authors end on the note that "a non-prescriptive stance" should be taken, but fail to elaborate on through which forms and means this could precisely be achieved (i.e., no alternative proposal is outlined).
Minor comment: the opening paragraph is entitled "Preface".This is a little bit unusual.
Reviewer #2 (Remarks to the Author): Thank you for the invitation to review 'The proliferation of measures in psychology: Between frustration and need' for Communications Psychology.
The commentary is fine as is, but could be strengthened further.I provide comments below.
Note that we have written about similar measurement issues, sometimes based on the exact arguments and rationale the authors here provide.It feels like the authors could strengthen the work by referring to some of this.On the one hand, I understand this is unusual and often a red flag, so I leave it to the authors and editor to decide whether some of these references would be helpful.On the other, I assume I was invited to review this paper because we have done a lot of work on the topic, and it would be remiss not to point out potential work on the topic.
Broadly, the authors respond to a piece with the argument that there are 2 justifiable reasons for measure proliferation.
(1) Summarized: "Different measures for construct X may depend on the circumstance/situation, i.e. measures are on text dependent (e.g.language, population)."I think you could rephrase the header, most readers will have a hard time understanding that.Some of the language seems unnecessary ("in virtually all cases idiosyncratic"), there are of course crosscultural or cross-setting studies less idiosyncratic than others.We have written about context in some detail in our piece arguing against mandating a particular depression instrument by NIH/WELLCOME that seems relevant in this context: https://pubmed.ncbi.nlm.nih.gov/32986859/.Given that the authors of the original piece the authors comment on will agree that different language versions should exist, this does not count as proliferation, and I would look for stronger examples the authors may disagree with.
(2) Summarized: "Different measures for construct X may tap into different constructs X" We talk about this in detail in here: https://www.nature.com/articles/s44159-022-00050-2,discussing proliferation of depression measures and how they were developed for different contexts and may tap into different constructs.An issue I see with the line of reasoning the authors provide is that psych scales rarely start with a clear construct definition and then properly measure that-this is also the case when interrogating the scales in our work.They were developed for a purpose (e.g., prevalence rate estimation vs tracking treatment progress), but clear (in the best case formal) theories of constructs are absent, making it difficult to evaluate whether measures measure the purported constructs because those weren't clearly defined.In that sense, one could disagree with the authors that proliferation "leads to more (higher quality" scales.This is only the case if constructs are clearly defined, which isn't the case.From our overview on the topic (https://journals.sagepub.com/doi/10.1177/2515245920952393):"Barry, Chaney, Piazza-Gardner, and Chavarria (2014) reported that between 40% and 93% of measures used across seven journals in educational behavior lacked validity evidence, and Weidman, Steckler, and Tracy (2017) reported that among the 356 measurement instances coded in their review of emotion research, 69% included no reference to prior research or a systematic development process.In their review of the relation between technology use and well-being, Orben and Przybylski (2019) reported that researchers pick and choose within and between questionnaires, "making the pre-specified constructs more of an accessory for publication than a guide for analyses" (p.181).
The authors then continue to look into consequences of less proliferation.
-"Unjustified variability in psychological measures should be discouraged -but it is difficult to ascertain when variability is indeed meaningless".I think there is decent literature on this-would anybody consider it reasonable to have 280 depression scales?https://www.tandfonline.com/doi/abs/10.1207/s15366359mea0403_1-"At the same time, programmatic stifling of such 74 variability is likely to be detrimental if taken to the extreme and blindly imposed or enforced across situations".Agreed, and there is literature on this, including a number of arguments: https://pubmed.ncbi.nlm.nih.gov/32986859/Best of luck with the manuscript!To improve the transparency and accountability of the peer review process, I sign my reviews Eiko Fried, Leiden University Dear Editor, We would like to thank you and the reviewers very much for taking the time to carefully read and comment on our submission to Communications Psychology with reference COMMSPSYCHOL-23-0377A.
We appreciate the valuable feedback and the opportunity to resubmit the manuscript with modifications.We respond below to reviewers' comments and also highlight changes that we implemented in the text.In this, we have gladly implemented the suggestions of the reviewers and hope that the manuscript is now eligible for publication in Communications Psychology.
Sincerely yours, *** Reviewer #1 (Remarks to the Author): Mixed feelings about this potential contribution: although comment on original work is, of course, always welcome, the original work presented large-scale data for supporting the views, suggestions, and proposals of the original authors, whilst this submitted commentary does not present any further data.Hence, it constitutes an opinion piece throughout, and I found it difficult to agree with the authors' various claims, starting from the statement in the opening paragraph that the suggested guidelines are "prescriptive" and would be "artificially restricting".>> Authors' response: Thank you for your comments.We have used them to refine the wording and, we believe, the fundamental tone of the paper.To the opening paragraph that you referred to: We have now changed the invoked stance in the preface, and have deleted both the word "artificially" and the suggestion that some guidelines may be "prescriptive".Again, line 77: "viewed by us as overly restrictive".My understanding was and is that these guidelines currently constitute a grassroots and bottom-up voluntary proposal, i.e., you and me and others CAN use it as a guidance -so why "overly restrictive"?>> Authors' response: Thank you very much for this important comment.We agree with your comment that, at this point in time, the guidelines that we refer to constitute a bottom-up voluntary proposal and have not seen broad implementation yet.In our commentary, we don't argue that their current impact has created a factually overly restrictive situation when it comes to developing new psychological measures.Instead, we highlight that the purpose of such guidelines is to seek comprehensive implementation across the field and that, through this, these guidelines have the potential to alter the landscape of the field.On this background, we argue that the content of the guidelineseven though currently at the "grass-root" levelcan be perceived as overly restrictive and require careful reconsideration before being put into practice.Our core argument is that the consequences of such guidelines might not have been sufficiently appreciated, in particular with regard to unintended side effects and that a careful process of weighing the pros and cons of measurement proliferation need to be reflected in any new guideline.In our view, such a process was not sufficiently implemented in recent guidelines, which motivated us to write this commentary with a broad allusion to the question of measurement proliferation in psychology.In the revised version, we have adapted the sentence on line 77 that you highlight in your comment.We have deleted the wording "viewed by us as overly restrictive".We have also introduced another, stronger example, and have rephrased reference to SOBER, i.e.,: "The recent SOBER guidelines, which also attempt to inhibit such variability are wellintentioned, but may have little impact if not enforced by funders and journals".
Lines 39-40: the example presented here, i.e., the RSES (Rosenberg Self-Esteem Scale) in my view actually is a counter-example, because self-esteem is a quite broad concept, but the RSES has been very Decision letter and referee reports: second round widely used and, in fact, is one of the most widely used (and thus most widely cited) self-report instruments across psychological science.Regarding the statement that the RSES item pool should be rephrased "when used in a sample of school children, and if used in a sample of immigrants": I wondered whether any such contextualizing, very specific, adaptations of the RSES are out there (the authors do not cite any)this claim might perhaps be more a form of social constructivism to which psychometricians would not agree.>> Authors' response: Thank you very much for flagging some relevant inconsistency in the example of the RSES.After careful consideration and with the length restrictions for commentaries in mind (the revised version is about 300 words longer than commentaries usually are), we have decided to delete the example entirely as it is not essential for our argument and we are confident that our argument will be equally accessible to readers even without the example.Also, some strawman argumentation seems to be put forward: RE "a single measure" of psychopathythis is not the case, and I think it has not been argued for.
>> Authors' response: Thank you for your comment.We did not intend this to be a strawman argumentationour line of argumentation was to suggest that there is a fine line after which restriction of measures becomes ridiculous.But we see now how this could be misconstrued, and have now changed the text to "… and several measures are needed in order to map onto all theoretical perspectives".
I also have to disagree with the claim that "proliferation and variability of psychological measures are a direct result of the scientific process, and this process leads not to fewer but to more (and higherquality) measures".Why should this exactly be the case?One would surmise that progress eventually leads to weeding out of the less than optimal measures.What is absent in this argumentation is the key concept of incremental validity in psychological assessment and scale development and validation: an alternative, novel measures must show superiority relative to existing measures.This key concept of incremental validity as one important, informative facet of the validation framework is glaringly absent in this text.>> Authors' response: Thank you for your comment and to flagging this concern.We have now redesigned the corresponding paragraph to explain our reasoning with more care.In our revision, we have considered in this paragraph also your next note (see immediately below) and we describe there in more detail how we proceeded with the revision.
RE "fragmentation can be interpreted as the continuous development of psychological science": I have to disagree here as well.Same argumentation as above: think, for instance, of the phenomena of consilience, universality, generalizability across and within science fields.From an epistemic point of view, fragmentation of concepts and knowledge can hardly be a sign of scientific progress.This debatable argument reappears throughout the text, e.g., line 87: "science progresses … through diversity".
>> Authors' response: Thank you for flagging your concern with regard to fragmentation in psychological science and the argument that we put forward.We believe that there might be a misunderstanding here.In our argument, we do not advocate for fragmentation of theoretical concepts in science (and we agree with the reviewer that we have too much of this across psychology).Our commentary is related to the proliferation of psychological _measures_ within theoretical concepts and we argue that, as theoretical ideas are refined and extended, new populations become the focus of investigation, or the number of use cases increases, proliferation of measures or fragmentation of measures is part of this process as all these different contextual factors might lead to (slightly) modified measures.Most of them will be either quickly dismissed or remain reserved for specific purposes, but some of them will experience wide usethis is the pattern that is to be expected and that is also highly consistent with the pattern reported in the recently published commentary to which we reference: Most measures are not used often and a minority experience wide use.Having multiple measures available actually provides an opportunity to test whether fragmented measures (as these slightly modified versions could be called) converge on an empirical and conceptual level and this serves as one of the strongest possible tests for the robustness of the underlying theoretical concept.
In the revised version of the paper, we have tried to clarify our argument and to avoid any potential for misunderstanding.It now reads: "Thus, we argue that proliferation and variability of psychological measures are a direct result of the scientific process.As theoretical ideas are refined, new populations come into focus, or the number of use cases increases, new measures are developed and most of them will be either quickly dismissed or remain reserved for specific purposes, but some of them will experience wide use.This process leads not to fewer but to more (and eventually higher-quality) measures.Meaningful latent constructs converge and emerge through such proliferation of measures.The entire replicability and transparency movement relies heavily on this idea: relations between theoretical constructs are generalizable only as far as they stand the test of diversity and are confirmed in different samples, with variable measures, and across different contexts.Instead of being considered negative, this observed fragmentation of measures within theoretical constructs can be interpreted as the continuous development of psychological science, if certain quality ensuring steps are adhered to." Another claim is (one again, presented without any refs) that "even minor adaptations in a psychological measure lead to (substantially) different results".Is this really the case, and would this be generalizable?Think of parallel test forms; they usually work quite well.>> Authors' response: Thank you again for challenging our ideas.The point we intended to make here was that minor adaptations in a psychological measure should not lead to substantially different results on the conceptual level and, if they do, one might question the robustness of the theoretical construct.Along this line of thinking, proliferation of measures can help to test conceptual robustness and generalizability.We did not suggest that minor adaptations lead to different result, but intended the text as an "what if" case: if even minor adaptations would lead to different results, then the initial findings would really be questionable.We have now revised our manuscript to add clarity.The revised text now reads: "We believe that minor adaptations in psychological measures should not lead to radically different results: in those cases where even minor adaptations in a psychological measure lead to (substantially) different results, it is reasonable to question the robustness of the initial findings on the latent level."Line 108, "(more) openly sharing information about any and all measures".This was already included in the SOBER guidelines.>> Authors' response: We do not contradict the meaningfulness of this proposal at all.We only think that it's too soft and more decisive action could be taken, building on technology.We have added this sentence: "However, we believe that good intentions regarding data sharing may not be sufficient for a systemic change, and we suggest that the change should be actively drivenfor example by elaborating in more details the technical standards for such (possibly automated) exchange of information." The authors end on the note that "a non-prescriptive stance" should be taken, but fail to elaborate on through which forms and means this could precisely be achieved (i.e., no alternative proposal is outlined).>> Authors' response: Thank you for challenging us to rethink our Conclusion section.We have now reorganized the three paragraphs in our Conclusions.
We have both shifted some of the text around, and have added new text.The flow of the Conclusion section is now thus: (1) We formulate a summary of our ideas and point out that there may be benign and justifiable reasons for variation in measures.(2) We point out that data sharing will exploit in a positive way, to the benefit of scientific progress, this tendency towards proliferationbut that good intentions are not enough.(3) We drive the proposal of a repository of measurement protocols but suggest that technical standards for automatization need to be developed.
Minor comment: the opening paragraph is entitled "Preface".This is a little bit unusual.>> Authors' response: Thank you so much.We have deleted the title "preface" in the revised version of the manuscript.
Reviewer #2 (Remarks to the Author): Thank you for the invitation to review 'The proliferation of measures in psychology: Between frustration and need' for Communications Psychology.
The commentary is fine as is, but could be strengthened further.I provide comments below.
Note that we have written about similar measurement issues, sometimes based on the exact arguments and rationale the authors here provide.It feels like the authors could strengthen the work by referring to some of this.On the one hand, I understand this is unusual and often a red flag, so I leave it to the authors and editor to decide whether some of these references would be helpful.On the other, I assume I was invited to review this paper because we have done a lot of work on the topic, and it would be remiss not to point out potential work on the topic.
Broadly, the authors respond to a piece with the argument that there are 2 justifiable reasons for measure proliferation.
(1) Summarized: "Different measures for construct X may depend on the circumstance/situation, i.e. measures are context dependent (e.g.language, population)."I think you could rephrase the header, most readers will have a hard time understanding that.Some of the language seems unnecessary ("in virtually all cases idiosyncratic"), there are of course cross-cultural or cross-setting studies less idiosyncratic than others.We have written about context in some detail in our piece arguing against mandating a particular depression instrument by NIH/WELLCOME that seems relevant in this context: https://pubmed.ncbi.nlm.nih.gov/32986859/.Given that the authors of the original piece the authors comment on will agree that different language versions should exist, this does not count as proliferation, and I would look for stronger examples the authors may disagree with.
>> Authors' response: Thank you for this important comment.We have changed the heading as you advised.The heading in the revised version of the manuscript now reads "measures are context dependent".We have reworded the text pointed out by you by deleting "and in virtually all cases highly idiosyncratic".We have also added a half-sentence clarifying that our argument do not relate to different language versions and that this would not be contested in the SOBER guidelines.The sentence at the end of the paragraph now reads: "…and are not limited to the rather well documented domain of linguistic/cultural test adaptation."Thank you also for pointing us here and elsewhere in your review to important references and examples that are relevant to the content of our commentary.With regard to specific examples, we decided to cut down on them in general across the revised version of the manuscript because (1) Reviewer 1 did not agree with several of them and we realized that there is quite some potential for misunderstanding when using examples without extensively elaborating on their specific context and meaning and (2) we are aware that we need to keep the length of this commentary within reasonable limits (it is already overlength now).However, we did manage to add one of the references that you provided in the revised version of the manuscript.
(2) Summarized: "Different measures for construct X may tap into different constructs X" We talk about this in detail in here: https://www.nature.com/articles/s44159-022-00050-2,discussing proliferation of depression measures and how they were developed for different contexts and may tap into different constructs.An issue I see with the line of reasoning the authors provide is that psych scales rarely start with a clear construct definition and then properly measure that-this is also the case when interrogating the scales in our work.They were developed for a purpose (e.g., prevalence rate estimation vs tracking treatment progress), but clear (in the best case formal) theories of constructs are absent, making it difficult to evaluate whether measures measure the purported constructs because those weren't clearly defined.In that sense, one could disagree with the authors that proliferation "leads to more (higher quality" scales.This is only the case if constructs are clearly defined, which isn't the case.From our overview on the topic (https://journals.sagepub.com/doi/10.1177/2515245920952393):" Barry, Chaney, Piazza-Gardner, and Chavarria (2014) reported that between 40% and 93% of measures used across seven journals in educational behavior lacked validity evidence, and Weidman, Steckler, and Tracy (2017) reported that among the 356 measurement instances coded in their review of emotion research, 69% included no reference to prior research or a systematic development process.In their review of the relation between technology use and well-being, Orben and Przybylski (2019) reported that researchers pick and choose within and between questionnaires, "making the pre-specified constructs more of an accessory for publication than a guide for analyses" (p.181).
>> Authors' response: Thank you for mentioning these important points and for embedding our commentary within the broader scope of the discussion on valid measurement in psychological science.We wish we could embark on a more comprehensive appreciation of the arguments (and the literature) that you provide within our commentary but the nature of such a commentary does not allow for an indepth discussion of all of these important points.However, we tried to allude to some of them throughout the manuscript and in the revised version we also cite one of the references that you provided (keeping the strict limit of 10 references for commentaries in mind).
The authors then continue to look into consequences of less proliferation.
-"Unjustified variability in psychological measures should be discouragedbut it is difficult to ascertain when variability is indeed meaningless".I think there is decent literature on this-would anybody consider it reasonable to have 280 depression scales?https://www.tandfonline.com/doi/abs/10.1207/s15366359mea0403_1-"At the same time, programmatic stifling of such variability is likely to be detrimental if taken to the extreme and blindly imposed or enforced across situations".Agreed, and there is literature on this, including a number of arguments: https://pubmed.ncbi.nlm.nih.gov/32986859/>> Authors' response: Thank you for your valuable feedback and for pointing out these relevant references.Given that there is a strict limit of 10 references for commentaries, we refrained from citing them, but in the revised version of the manuscript we mention that there is additional and more detailed literature on these topics already available and that these topics have been discussed within the scientific community.