How often do researchers plagiarize each other's work? The question has previously been almost impossible to answer, as no large-scale survey of the practice had been conducted. But a computer scientist has now examined more than a quarter of a million documents from a physics preprint server. The results contain the comforting news that blatant deception is rare, but suggest that minor acts of misconduct may be more common than was previously thought.

Student plagiarism can often be checked using specialist databases of essays available for sale online, but plagiarism in published research is harder to police. Many publishers don't allow search engines to index the full text of their papers, so it's impossible to run electronic checks on new studies. Those small surveys that have been done revealed little evidence of plagiarism, but suggested that duplicate publications — in which a significant amount of an existing paper by the same author is reused without providing a reference to the original — could make up about 10% of the literature in some fields (see Nature 435, 258–259; 2005).

Firmer numbers can now be put on those estimates, thanks to the work of Daria Sorokina, a PhD student at Cornell University in Ithaca, New York. Sorokina's software trawled more than 280,000 entries in arXiv, a database of mainly physics, maths and computer-science preprints maintained at Cornell. Her code divides documents into seven-word chunks and looks for pairs of papers that share a suspicious number of such chunks (common phrases such as “this work was supported in part by” are excluded). The result is a list of possible plagiarisms or, if the documents share a common author, duplicate publications.

The search turned up 677 examples of possible plagiarism, of which Sorokina and her colleagues took a close look at 20. Only four were innocent mistakes, such as articles that quoted text from a third scientist. Three of the others were judged to be 'serious plagiarism' in which one article was essentially a copy of another, and in the others, parts of the paper such as the introduction or related work sections had been copied without appropriate references being given. If the analysis scales up, then just 0.2% of arXiv documents contain plagiarism. The results will be presented at the IEEE International Conference on Data Mining, to be held on 18–22 December in Hong Kong.

Results for duplicates are potentially more alarming but harder to assess. Sorokina identified 30,316 pairs where one was largely a copy of the other — more than 10% of the database. But arXiv differs from journals in that researchers submit conference proceedings as well as the journal papers that are derived from them. Paul Ginsparg, a Cornell physicist who worked with Sorokina on the survey, says the “vast majority” of duplicates found are of this type, but adds that he was surprised at the number of student theses that included material copied verbatim from other sources.

This may be the most accurate global estimate that we have for plagiarism in the scientific literature.

Despite the leap forward provided by the arXiv survey, many issues remain unresolved. The survey picks up only cases where the source that has been plagiarized is also present on the arXiv database (although in many fields arXiv has near-complete coverage). And the software is unable to pick up 'intelligent plagiarism', where material copied from another author is reworded.

Researchers may also behave differently when submitting to arXiv compared with peer-reviewed journals, and different rates may exist for biologists, who rarely use preprint servers. Plagiarism in biology could in principle be studied using PubMed Central, an archive of journal papers maintained by the US National Institutes of Health. Ginsparg says he has discussed this with PubMed's staff, but such a survey would currently be of limited use, as only a small fraction of papers are placed in the database after publication.

Larry Claxton, a toxicologist at the US Environmental Protection Agency in North Carolina, who has studied plagiarism in the life sciences, says he would like to see a more detailed examination of the duplicates in the arXiv study before interpreting the results. But despite the limitations, he says, “this may be the most accurate global estimate that we have for plagiarism in the scientific literature”.

Ginsparg says he would now like arXiv to scan all new entries using Sorokina's software and alert the author in the case of suspicious overlap. The researcher would then have the option of rewriting the paper or, if they felt the overlap was justified, submitting as usual. Ginsparg says the system could be in place by the middle of next year. Preliminary discussions have also taken place with CrossRef, a publishers' group, about whether journals could work together to implement a similar scheme.