When two Spanish researchers checked the statistics in scientific papers from the BMJ and Nature in 2004, their results prompted a flurry of headlines and soul-searching for editors.

“Sloppy stats shame science” ran the headline in The Economist. “Statistical flaws revealed in top journals' papers” declared New Scientist. The revelation that more than a third of all Nature papers in 2001 contained statistical errors prompted the journal to introduce new checks on quality. At the BMJ (formerly the British Medical Journal), where one in four papers was found to contain flaws, editors ran an editorial discussing potential solutions.

Now it seems that another manuscript can be added to the pile of flawed publications: the paper by the Spanish researchers. According to an analysis published earlier this month, one of its two conclusions is invalid because it is based on inappropriate statistics. “Statistical tests should be used correctly,” notes the dry conclusion of the new paper.

The test in question was employed by ecologists Emili García-Berthou and Carles Alcaraz from the University of Girona to assess the use of rounding. If rounded correctly, the final digit of a number quoted to three or four significant figures should have an equal chance of being any digit between zero and nine. García-Berthou and Alcaraz found that fours and nines appeared less frequently than expected, perhaps because researchers have an unconscious preference for rounding to five and zero, respectively (E. García-Berthou and C. Alcaraz BMC Med. Res. Methodol. 4, 13; 2004).

But Monwhea Jeng, a physicist at Syracuse University in New York, points out that García-Berthou and Alcaraz employed the Kolmogorov–Smirnov test, which is used to evaluate whether the probability distributions underlying two sets of numbers differ. Jeng says this was a mistake, because the test assumes that the distributions involved are continuous; in the case of the final digit of a number, the distribution is discrete. He says a more appropriate analysis using a chi-squared test shows no evidence of rounding errors (M. Jeng BMC Med. Res. Methodol. 6, 45; 2006).

Two independent statisticians contacted by Nature say Jeng's analysis seems to be correct, and a referee report, posted on the journal's website, indicates that use of a different statistical package confirmed Jeng's conclusion. But García-Berthou insists his use of the test was appropriate, and says he will explain why in a letter to the journal. The argument could be viewed as no more than amusingly ironic, especially because Jeng's analysis does not affect García-Berthou and Alcaraz's second and perhaps more important discovery: that many P values in Nature and BMJ papers were wrongly calculated. But the incident does illustrate the difficulty of assessing whether statistical tests have been properly used, especially given that scientists well trained in statistics do not always agree.

I view most of the literature as done wrong.

“If a journal doesn't have enough expertise it's a real problem,” says Steve Goodman, a medical statistician and editor at the Annals of Internal Medicine. He says he rarely sees papers that are free of flaws. “I view most of the literature as done wrong.”

A study published this July, for example, examines the use of P values in paper abstracts (P. C. Gøtzsche BMJ 333, 231–234; 2006). The paper concludes: “Significant results in abstracts are common but should generally be disbelieved.”