Automated screening: ArXiv screens spot fake papers

Journal name:
Date published:
Published online

Unlike the computer-generated nonsense papers in some peer-reviewed subscription services (see Nature; 2014), the 500 or so preprints received daily by the automated repository arXiv are not pre-screened by humans. But sometimes automated assessment can be better than human diligence at enforcing standards.

The automated screens for outliers in arXiv include analysis of the probability distributions of words and their combinations, ensuring that they fall into patterns that are consistent with existing subject classes. This serves as a check of the subject categorizations provided by submitters, and helps to detect non-research content.

Fake papers generated by SCIgen software, for example, have a 'native dialect' that can be picked up by simple stylometric analysis (see J. N. G. Binongo Chance 16, 917; 2003). The most frequent words used in English text (stop words such as 'the', 'of', 'and') encode stylistic features that are independent of content. On average, these words follow a power-law distribution that is evident in even relatively small amounts of text; significant deviations signal outliers.

The effect can be seen in principal-component analysis plots (see 'Counterfeit clusters'). Computer-generated articles form tight clusters that are well separated from human-authored articles.

Author information


  1. Cornell University, Ithaca, New York, USA.

    • Paul Ginsparg

Corresponding author

Correspondence to:

Author details


  1. Report this comment #63061

    Brian Josephson said:

    In response to Ginsparg's letter, I submitted the following comment to Nature Correspondence:


    Ginsparg (ArXiv screens spot fake papers, Nature 508, 44; 2014) has extolled the benefits of automated assessment of papers uploaded to an archive, as opposed to ?human diligence?. As with telephone helplines, automated processing can be problematic. An automated process focussed on the deviation of a paper from norms will have difficulty distinguishing between submissions that have unusual characteristics because they are bad, and ones that are unusual because they involve a novel approach. Submissions of both types seem to be treated in a similar way, by arXiv?s robot and by volunteers giving the ?cursory glance? to new submissions previously described (ArXiv at 20, Nature 476, 145?147; 2011).

    Through the use of such mechanical processes, using the wrong word in a paper can lead to its progress being seriously impeded rather than quickly becoming public. There is a distinct similarity between arXiv's activities and the way security agencies go about their business processing the data they collect, in the latter case looking for patterns indicative of terrorists. ArXiv's own ?dangerous items? (as has been revealed by someone familiar with the details) are much influenced by ?reader complaints?; however, many important ideas were equally the subject of ?reader complaints? when first proposed. Terrorists one has to try to stop, but few scientists have had fatal encounters with papers whose subject matter they have found disagreeable.

    ArXiv is not some kind of journal conferring approval on accepted papers, and keeping fussy readers happy should not take priority over its primary purpose, facilitating communication among researchers. It should accordingly cease using the aggressive review processes currently employed.


    Asked by Nature to respond, Ginsparg replied that the filter he was talking about only looked for patterns in very basic words. This ignores the fact that submissions are regularly screen to see if they contain 'naughty words', and if they do then submissions than normally would routinely get through are singled out for special attention, often get blocked for doubtful reasons. Examples:

    • Mark Davidson reports that his paper giving a theoretical explanation for Low Energy Nuclear Reactions or LENR (a 'naughty word' as far as arXiv is concerned, it seems) was put on indefinite hold, eventually being allowed in only after it was published, a constraint not normally applied.
    • Mats Lewan reports that when the paper arXiv:1305.3913 on a similar subject (experimental investigation of the claims for the Rossi reactor) was submitted the moderators (as revealed in accidentally leaked emails) tried hard to find reasons for not allowing it, but had to admit to failure.
    • I recently obtained the rights to conference proceedings I had co-edited. Cambridge University Library's depository had no problems with this, but it has been under review by arXiv's moderators for a very long time, I suspect because just one of the chapters is on a 'naughty subject', experiments in psychokineses.

    In his response to my letter, Ginsparg referred to 'other forms of nonsense or non-research content'. It is hard to see how the above fit into either of these categories.

    How much does this kind of censorship matter? In this connection, I have speculated elsewhere ( as to what might have happened to Einstein or C N Yang had they tried to submit preprints to the archive:

    It is just an ordinary day at the headquarters of the physics preprint archive. The operators are going through their daily routine and are discussing what to do about recent emails:

    Some "reader complaints" have come in regarding preprints posted to the archive by Drs. Einstein and Yang. Dr. Einstein, who is not even an academic, claims to have shown in his preprint that mass and energy are equivalent, while Professor Yang is suggesting, on the basis of an argument I find completely unconvincing, that parity is not conserved in weak interactions. What action shall I take?

    Abject nonsense! Just call up their records and set their 'barred' flags to TRUE.

  2. Report this comment #63065

    Brian Josephson said:

    Apologies for all the ?'s in the above, the first part of which started life as smart quotes that I copied and pasted from a version of my letter already posted on a web site. Evidently, What You See Is Not Always What You Get as far as Nature's preview mechanism is concerned. Clearly, most of the queries are really quotes, with just one exception involving page numbering.

Subscribe to comments

Additional data