Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Automated screening

ArXiv screens spot fake papers

A Correction to this article was published on 30 April 2014

Unlike the computer-generated nonsense papers in some peer-reviewed subscription services (see Nature http://doi.org/r3n; 2014), the 500 or so preprints received daily by the automated repository arXiv are not pre-screened by humans. But sometimes automated assessment can be better than human diligence at enforcing standards.

The automated screens for outliers in arXiv include analysis of the probability distributions of words and their combinations, ensuring that they fall into patterns that are consistent with existing subject classes. This serves as a check of the subject categorizations provided by submitters, and helps to detect non-research content.

Fake papers generated by SCIgen software, for example, have a 'native dialect' that can be picked up by simple stylometric analysis (see J. N. G. Binongo Chance 16, 9–17; 2003). The most frequent words used in English text (stop words such as 'the', 'of', 'and') encode stylistic features that are independent of content. On average, these words follow a power-law distribution that is evident in even relatively small amounts of text; significant deviations signal outliers.

The effect can be seen in principal-component analysis plots (see 'Counterfeit clusters'). Computer-generated articles form tight clusters that are well separated from human-authored articles.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Paul Ginsparg.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Ginsparg, P. ArXiv screens spot fake papers. Nature 508, 44 (2014). https://doi.org/10.1038/508044a

Download citation

Further reading

Search

Quick links