News
Published: 01 November 2006

Preprint analysis quantifies scientific plagiarism

Jim Giles

Nature volume 444, pages 524–525 (2006)Cite this article

1376 Accesses
44 Citations
14 Altmetric
Metrics details

Physics papers reveal few serious breaches but some duplication.

How often do researchers plagiarize each other's work? The question has previously been almost impossible to answer, as no large-scale survey of the practice had been conducted. But a computer scientist has now examined more than a quarter of a million documents from a physics preprint server. The results contain the comforting news that blatant deception is rare, but suggest that minor acts of misconduct may be more common than was previously thought.

Student plagiarism can often be checked using specialist databases of essays available for sale online, but plagiarism in published research is harder to police. Many publishers don't allow search engines to index the full text of their papers, so it's impossible to run electronic checks on new studies. Those small surveys that have been done revealed little evidence of plagiarism, but suggested that duplicate publications — in which a significant amount of an existing paper by the same author is reused without providing a reference to the original — could make up about 10% of the literature in some fields (see Nature 435, 258–259; 2005).

Firmer numbers can now be put on those estimates, thanks to the work of Daria Sorokina, a PhD student at Cornell University in Ithaca, New York. Sorokina's software trawled more than 280,000 entries in arXiv, a database of mainly physics, maths and computer-science preprints maintained at Cornell. Her code divides documents into seven-word chunks and looks for pairs of papers that share a suspicious number of such chunks (common phrases such as “this work was supported in part by” are excluded). The result is a list of possible plagiarisms or, if the documents share a common author, duplicate publications.

The search turned up 677 examples of possible plagiarism, of which Sorokina and her colleagues took a close look at 20. Only four were innocent mistakes, such as articles that quoted text from a third scientist. Three of the others were judged to be 'serious plagiarism' in which one article was essentially a copy of another, and in the others, parts of the paper such as the introduction or related work sections had been copied without appropriate references being given. If the analysis scales up, then just 0.2% of arXiv documents contain plagiarism. The results will be presented at the IEEE International Conference on Data Mining, to be held on 18–22 December in Hong Kong.

Results for duplicates are potentially more alarming but harder to assess. Sorokina identified 30,316 pairs where one was largely a copy of the other — more than 10% of the database. But arXiv differs from journals in that researchers submit conference proceedings as well as the journal papers that are derived from them. Paul Ginsparg, a Cornell physicist who worked with Sorokina on the survey, says the “vast majority” of duplicates found are of this type, but adds that he was surprised at the number of student theses that included material copied verbatim from other sources.

This may be the most accurate global estimate that we have for plagiarism in the scientific literature.

Despite the leap forward provided by the arXiv survey, many issues remain unresolved. The survey picks up only cases where the source that has been plagiarized is also present on the arXiv database (although in many fields arXiv has near-complete coverage). And the software is unable to pick up 'intelligent plagiarism', where material copied from another author is reworded.

Researchers may also behave differently when submitting to arXiv compared with peer-reviewed journals, and different rates may exist for biologists, who rarely use preprint servers. Plagiarism in biology could in principle be studied using PubMed Central, an archive of journal papers maintained by the US National Institutes of Health. Ginsparg says he has discussed this with PubMed's staff, but such a survey would currently be of limited use, as only a small fraction of papers are placed in the database after publication.

Larry Claxton, a toxicologist at the US Environmental Protection Agency in North Carolina, who has studied plagiarism in the life sciences, says he would like to see a more detailed examination of the duplicates in the arXiv study before interpreting the results. But despite the limitations, he says, “this may be the most accurate global estimate that we have for plagiarism in the scientific literature”.

Ginsparg says he would now like arXiv to scan all new entries using Sorokina's software and alert the author in the case of suspicious overlap. The researcher would then have the option of rewriting the paper or, if they felt the overlap was justified, submitting as usual. Ginsparg says the system could be in place by the middle of next year. Preliminary discussions have also taken place with CrossRef, a publishers' group, about whether journals could work together to implement a similar scheme.

Authors

Jim Giles
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Giles, J. Preprint analysis quantifies scientific plagiarism. Nature 444, 524–525 (2006). https://doi.org/10.1038/444524b

Download citation

Published: 01 November 2006
Issue Date: 30 November 2006
DOI: https://doi.org/10.1038/444524b

This article is cited by

Methane and CO2 emissions from China’s hydroelectric reservoirs: a new quantitative synthesis
- Siyue Li
- Quanfa Zhang
- Leigh A. Sullivan
Environmental Science and Pollution Research (2015)
Carbon emission from global hydroelectric reservoirs revisited
- Siyue Li
- Quanfa Zhang
Environmental Science and Pollution Research (2014)
Text-Based Plagiarism in Scientific Publishing: Issues, Developments and Education
- Yongyan Li
Science and Engineering Ethics (2013)
Redressing China’s Strategy of Water Resource Exploitation
- Lishan Ran
- Xi Xi Lu
Environmental Management (2013)
Chinese hydropower companies and environmental norms in countries of the global South: the involvement of Sinohydro in Ghana’s Bui Dam
- Oliver Hensengerth
Environment, Development and Sustainability (2013)

Preprint analysis quantifies scientific plagiarism

Related links

Related links in Nature Research

Related external links

Rights and permissions

About this article

Cite this article

This article is cited by

Methane and CO2 emissions from China’s hydroelectric reservoirs: a new quantitative synthesis

Carbon emission from global hydroelectric reservoirs revisited

Text-Based Plagiarism in Scientific Publishing: Issues, Developments and Education

Redressing China’s Strategy of Water Resource Exploitation

Chinese hydropower companies and environmental norms in countries of the global South: the involvement of Sinohydro in Ghana’s Bui Dam

Search

Quick links

Related links

Related links

Related links in Nature Research

Related external links

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Methane and CO2 emissions from China’s hydroelectric reservoirs: a new quantitative synthesis

Carbon emission from global hydroelectric reservoirs revisited

Text-Based Plagiarism in Scientific Publishing: Issues, Developments and Education

Redressing China’s Strategy of Water Resource Exploitation

Chinese hydropower companies and environmental norms in countries of the global South: the involvement of Sinohydro in Ghana’s Bui Dam

Search

Quick links