Unlike the computer-generated nonsense papers in some peer-reviewed subscription services (see Nature http://doi.org/r3n; 2014), the 500 or so preprints received daily by the automated repository arXiv are not pre-screened by humans. But sometimes automated assessment can be better than human diligence at enforcing standards.
The automated screens for outliers in arXiv include analysis of the probability distributions of words and their combinations, ensuring that they fall into patterns that are consistent with existing subject classes. This serves as a check of the subject categorizations provided by submitters, and helps to detect non-research content.
Fake papers generated by SCIgen software, for example, have a 'native dialect' that can be picked up by simple stylometric analysis (see J. N. G. Binongo Chance 16, 9–17; 2003). The most frequent words used in English text (stop words such as 'the', 'of', 'and') encode stylistic features that are independent of content. On average, these words follow a power-law distribution that is evident in even relatively small amounts of text; significant deviations signal outliers.
The effect can be seen in principal-component analysis plots (see 'Counterfeit clusters'). Computer-generated articles form tight clusters that are well separated from human-authored articles.
About this article
Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines