Nature | Toolbox

Stat-checking software stirs up psychology

Researchers debate whether using a program to automatically detect inconsistencies in papers improves the literature, or raises false alarms.

Article tools

Illustration by The Project Twins

Michèle Nuijten and her colleagues found rampant inconsistencies when they unleased statcheck on the psychological literature. The program scans articles for statistical results, redoes the calculations and checks that the numbers match. It went through 30,717 papers to identify 16,695 that tested hypotheses using statistics. In half of those, it found at least one potential error (M. B. Nuijten et al. Behav. Res. Methods 48, 1205–1226; 2016).

Nuijten did not alert the papers' authors. But this August, her co-author Chris Hartgerink, a fellow methodologist at Tilburg University in the Netherlands, moved the focus from the literature in general to specific papers. He set statcheck to work on more than 50,000 papers, and posted its reports on PubPeer, an online forum in which scientists often dispute papers. That has prompted a sometimes testy debate about how such tools should be used.

“The program is still very immature, but in the long run could keep scientists honest.”

Hartgerink predicted that the posts would inform readers and authors about potential errors and “benefit the field more directly than just dumping a data set”. Not everyone agreed. On 20 October, the German Psychological Association warned that posting false findings of error could damage researchers' reputations. And later that month, a former president of the Association for Psychological Science in Washington DC decried the rise of “uncurated, unfiltered denigration” through blogs and social media, and implied that posts from statcheck-like programs could be seen as harassment.

Others foresee a positive change in the culture. Hartgerink and Nuijten have each received awards from organizations promoting open science. And in a PubPeer comment on the original statcheck paper, psychology researcher Nick Brown of the University of Groningen in the Netherlands wrote that science might benefit if researchers stopped assuming that posts on the forum indicated that there was “something naughty” in a paper, and instead thought, “There's a note on PubPeer, I will read it and evaluate it like a scientist.”

An automated tool makes researchers more likely to double-check their work, which is good for psychology, argues Simine Vazire, who studies self-perception at the University of California, Davis. “It will catch mistakes, but even more importantly it will make us more careful."

That seems to appeal. Several thousand people have downloaded the free statcheck program, which works in the programming language R, or visited the web-based statcheck.io, which requires no programming knowledge. (Researchers who want to check selected results rather than whole papers can use online calculators such as ShinyApps.)

Technical check

Most psychology papers report statistical tests in a standardized format, with related parameters that can be checked for inconsistencies. Statcheck — which so far works only for papers in this format — identifies and inspects a few common tests that calculate P values, a measure of how likely results are to arise by chance if, for instance, no real difference exists between two groups (see 'What statcheck looks for'). Although statisticians have warned against it, a P value below 0.05 is often used as an arbitrary determiner of 'statistical significance', allowing results to be taken seriously and published.

Most of the errors that statcheck catches seem to be typos or copy-and-pasting mistakes, says Daniel Lakens, a cognitive psychologist at Eindhoven University of Technology in the Netherlands. After reading the statcheck paper, he decided to analyse the errors it reported that changed a result's statistical significance. He found three main categories. Often, a researcher had inserted an incorrect sign, such as P < 0.05 instead of P = 0.05. In other cases, the calculations were set up to detect only particular relationships, such as positive or negative correlation, which was not always made explicit. Optimistic rounding was also common: P values of 0.055 reported as P ≤ 0.05 made up 10% of detected errors that changed statistical significance, a rate that Lakens calls depressingly high.

But statcheck itself makes errors, says Thomas Schmidt, an experimental psychologist at the University of Kaiserslautern in Germany, who wrote a critique of the program (T. Schmidt Preprint at http://arxiv.org/abs/1610.01010; 2016) after it flagged two of his papers. For example, it does not always recognize necessary statistical adjustments.

When statcheck does detect an error, it cannot distinguish whether it is the P value or a related parameter that is incorrect. Schmidt says that, across the two of his papers that it scanned, statcheck failed to detect 43 P values, checked 137 and noted 35 “potentially incorrect statistical results”. Of those, 2 reflected P-value errors that did not change significance, 3 reflected errors in other parameters but did not affect P values, and 30 were improperly flagged.

Nuijten admits that statcheck can sometimes misidentify tests and overlook adjusted P values, but she notes that, in her original paper, it found similar rates of error to manual checks.

Nuijten and Hartgerink have been working hard, mostly successfully, to keep conversations amiable. Nuijten has posted detailed explanations about how statcheck works, with smiley emoji and friendly exclamation marks. Hartgerink is updating PubPeer posts with an improved version of the software. Both note that anyone can add comments on PubPeer to explain statcheck's results, and that the posts state that results are not definitive. “The one thing I try to repeat over and over is that statcheck is automated software that will never be as accurate as a manual check,” says Nuijten.

Much of what statcheck flags up is trivial, but when authors do not respond, matters are left unresolved, says Elkan Akyürek, a psychologist at the University of Groningen. “Content-based discussion is getting a bit flooded.” Thought leaders such as neuropsychologist Dorothy Bishop of the University of Oxford, UK, worry that posts could distract from more serious discussions, or alienate people and make them less receptive to efforts to improve reproducibility. Heiko Hecht, a psychologist at Johannes Gutenberg University in Mainz, Germany, thinks it might have the opposite effect: “The program is still very immature, but in the long run could keep scientists honest.” Besides, he adds, if researchers made raw data available, anyone could check the results.

Visit the Toolbox hub for more articles

Some authors have expressed gratitude for a chance to correct mistakes, although several have said that they should have the chance to review posts before they are made public. At least three have responded on PubPeer to explain errors. Two of them told Nature that the errors were typos that did not affect P values and were too trivial to justify a formal correction. As for Vazire, she hopes that automated reports will help researchers to get used to post-publication commentary. “I think it will help desensitize us to criticism,” she says.

Editor's helper

In July this year, the journal Psychological Science began running statcheck on submissions that got favourable first reviews, and discussing flagged inconsistencies with the authors. “I thought there might be some blowback or resistance,” says editor-in-chief Stephen Lindsay. “Reaction has been almost non-existent.” Of the few dozen runs so far, none of the errors has been egregious, he says, although there have been at least two instances in which authors have reported a P value as 0.05 when it was 0.054.

Lindsay says that statcheck reports are too confusing to share with authors directly. (For example, the program flags potential errors with the word TRUE.) Nuijten says that an upcoming version will be much more comprehensible to non-programmers. Meanwhile, she says, her team has been talking to publishers Elsevier and PLOS about adopting the program at their titles. And statcheck may soon have company: a more-comprehensive commercial program called StatReviewer is under development by other researchers. It is designed to analyse papers from a variety of fields, not just to double-check calculations but also to ensure that reporting requirements are followed.

Lindsay hopes that statcheck's utility will fade over time as researchers stop manually entering statistical outcomes into their manuscripts; instead, the values would be directly inserted by the programs that produced them, and linked to their scripts. “The methodological leaders are using things like R markdown,” he says.

As for Schmidt, he thinks that statcheck could be useful in manuscript preparation, but it is not for beginners. “The greatest risk during prepublication is that unsophisticated users overestimate the program, relying blindly on its output.” Lakens is sticking to a manual system: one author of a paper does the analyses, and another checks them. That can detect errors that statcheck will not, such as transposing results.

That approach makes sense to Nuijten. Her goal was never to fix statistical analysis. Statcheck is more like a standard spellchecker, she says: “a handy tool that sometimes says stupid things”. People laugh at the absurdities, but still use the tool to correct mistakes.

Journal name:
Nature
Volume:
540,
Pages:
151–152
Date published:
()
DOI:
doi:10.1038/540151a

For the best commenting experience, please login or register as a user and agree to our Community Guidelines. You will be re-directed back to this page where you will see comments updating in real-time and have the ability to recommend comments to other users.

Comments for this thread are now closed.

Comments

1 comment Subscribe to comments

  1. Avatar for Daniel Oberfeld
    Daniel Oberfeld
    In my view, this article underestimates the risks of using statcheck. As mentioned in the text, statcheck does not account for corrections for multiple testing (which is discussed in Nuijten et al. Behav. Res. Methods 48, 1205–1226; 2016). Probably even more serious, it does not account for corrected degrees-of-freedom (dfs) that MUST be used in some analyses, for example in within-subjects (repeated-measures) designs (e.g., Keselman, H. J., Algina, J., & Kowalchuk, R. K. (2001). The analysis of repeated measures designs: A review. British Journal of Mathematical & Statistical Psychology, 54, 1-20, http://onlinelibrary.wiley.com/doi/10.1348/000711001159357/full). The authors of the paper by Nuijten et al. seem not at all aware of this situation, because they mention the frequently used "Huynh-Feldt" correction for a deviation from a spherical variance-covariance matrix in a repeated-measures design as belonging to the class of "correction for multiple testing", which it is of course not: "However, when we automatically searched our sample of 30,717 articles, we found that only 96 articles reported the string "Bonferroni" (0.3 %) and nine articles reported the string "Huynh-Feldt" or "Huynh Feldt" (0.03 %). We conclude from this that corrections for multiple testing are rarely used and will not significantly distort conclusions in our study." (Nuijten et al., 2016, p. 1207). They also seem not aware that there are other frequently-used corrections for deviations from sphericity like Greenhouse-Geisser, or other df-corrections used in between-subjects design, like Satterthwaite (and, consequently, did not search for such terms in the papers they analyzed). For the example of my own papers flagged by statcheck on PubPeer, virtually all of the "errors" indicated by the tool arise because, as necessary in a within-subjects design, we used correction for the degrees of freedom when computing repeated-measures ANOVA using a univariate approach (e.g., Oberfeld, D., & Franke, T. (2013). Evaluating the robustness of repeated measures analyses: The case of small sample sizes and non-normal data. Behavior Research Methods, 45(3), 792-812. http://dx.doi.org/10.3758/s13428-012-0281-2). Also, we report the uncorrected dfs together with the F-value and also report the size of the Huynh-Feldt or Greenhouse-Geisser correction factor (epsilon-hat or epsilon-tilde). This is the recommended and established form, because only when the uncorrected dfs are reported it is easy to conduct a plausibility check to see if the selected test is suitable for the experimental design (i.e., number of subjects, number of levels of the within- and between-subjects factors). In its present form, statcheck reports "errors" when authors applied the proper df-correction, while it will show no "errors" for authors who for example did not correct for deviations from sphericity. In other words, if statcheck reports an error, this often indicates that the analyses were done properly. The very severe negative impact could be that authors who for example correctly applied the Huynh-Feldt correction, and then use statcheck to check their paper before submission, might think that they made a mistake and for this reason step back to the INCORRECT F-tests using uncorrected dfs. Thus, in the long run, the tool very likely has the effect of more rather than fewer statistical errors. All of these issues are NOT discussed in the paper by Nuijten et al. At present, the only utility of statcheck seems to detect rounding or copy-paste errors. While this is of course valuable information, in my experience as a reviewer the much more serious problem is that many authors apply the incorrect test (rather than simply reporting a numerically incorrect p-value). For example, ANOVAs for completely randomized designs are used when the data originate from a within-subjects design, corrections for the degrees of freedom are not used, etc. Some of these errors can be detected by comparing the (uncorrected) dfs to the sample size and the number of factor levels, but I'd expect these checks to be difficult with an automated tool.
sign up to Nature briefing

What matters in science — and why — free in your inbox every weekday.

Sign up

Listen

new-pod-red

Nature Podcast

Our award-winning show features highlights from the week's edition of Nature, interviews with the people behind the science, and in-depth commentary and analysis from journalists around the world.