ArXiv screens spot fake papers

Ginsparg, Paul

doi:10.1038/508044a

Download PDF

Correspondence
Published: 02 April 2014

Automated screening

ArXiv screens spot fake papers

Paul Ginsparg¹

Nature volume 508, page 44 (2014)Cite this article

6589 Accesses
7 Citations
86 Altmetric
Metrics details

Subjects

A Correction to this article was published on 30 April 2014

Unlike the computer-generated nonsense papers in some peer-reviewed subscription services (see Nature http://doi.org/r3n; 2014), the 500 or so preprints received daily by the automated repository arXiv are not pre-screened by humans. But sometimes automated assessment can be better than human diligence at enforcing standards.

The automated screens for outliers in arXiv include analysis of the probability distributions of words and their combinations, ensuring that they fall into patterns that are consistent with existing subject classes. This serves as a check of the subject categorizations provided by submitters, and helps to detect non-research content.

Fake papers generated by SCIgen software, for example, have a 'native dialect' that can be picked up by simple stylometric analysis (see J. N. G. Binongo Chance 16, 9–17; 2003). The most frequent words used in English text (stop words such as 'the', 'of', 'and') encode stylistic features that are independent of content. On average, these words follow a power-law distribution that is evident in even relatively small amounts of text; significant deviations signal outliers.

The effect can be seen in principal-component analysis plots (see 'Counterfeit clusters'). Computer-generated articles form tight clusters that are well separated from human-authored articles.

Author information

Authors and Affiliations

Cornell University, Ithaca, New York, USA
Paul Ginsparg

Authors

Paul Ginsparg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Ginsparg.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ginsparg, P. ArXiv screens spot fake papers. Nature 508, 44 (2014). https://doi.org/10.1038/508044a

Download citation

Published: 02 April 2014
Issue Date: 03 April 2014
DOI: https://doi.org/10.1038/508044a

This article is cited by

Detecting automatically generated sentences with grammatical structure similarity
- Nguyen Minh Tien
- Cyril Labbé
Scientometrics (2018)
Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines
- Jennifer A. Byrne
- Cyril Labbé
Scientometrics (2017)
Comparing the topological properties of real and artificially generated scientific manuscripts
- Diego Raphael Amancio
Scientometrics (2015)

ArXiv screens spot fake papers

Subjects

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

This article is cited by

Detecting automatically generated sentences with grammatical structure similarity

Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines

Comparing the topological properties of real and artificially generated scientific manuscripts

Search

Quick links

Subjects

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Detecting automatically generated sentences with grammatical structure similarity

Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines

Comparing the topological properties of real and artificially generated scientific manuscripts

Search

Quick links