Carbon copies? Software programmes can highlight suspected duplications. Credit: Digital Vision

As many as 200,000 of the 17 million articles in the Medline database might be duplicates, either plagiarized or republished by the same author in different journals, according to a commentary published in Nature today1.

Mounir Errami and Harold ?Skip? Garner at the The University of Texas Southwestern Medical Center in Dallas, used text-matching software to look for duplicate or highly-similar abstracts in more than 62,000 randomly selected Medline abstracts published since 1995. They hit on 421 possible duplicates.

After manual inspection they estimated that 0.04% of the 62,000 articles might be plagiarized, and 1.35% duplicates with the same author. These percentages are lower than those calculated by similar previous studies. As yet, the researchers aren't sure why that is.

A thorough examination of apparent duplicates is always essential, they add, to verify whether papers are in fact plagiarised or resubmitted articles, or simply false positives. Some cases of duplication may also be done innocently. The ultimate decision has to be made by an authoritative body such as a journal's editors or an ethics board, they note.

Spot check

The researchers didn't have enough computer power to look for comparisons between all 17 million Medline articles. But they found a shortcut to weed out more examples of possible duplicates. They had noticed that in almost three quarters of cases in the first 421 duplicates they identified, the duplicate article also cropped up in Medline itself as the 'most related article' ? a feature the Medline database offers to help searchers find related work.

So they looked more closely at more than 7 million Medline abstracts with listed related articles, running their algorithm against just the original abstract and its 'most related' abstract. This threw up around 70,000 potential duplicates. The team guesses from their experience so far that 50,000 of these will turn out to be true duplicates, with the rest being false positives. Extrapolating this result to all 17 million articles in Medline, applying some correction factors, they come up with an estimate of some 200,000 duplicates.

The researchers have put the 70,000 suspected duplicates on a publicly accessible database called D£j£ vu. Users of this site are welcomed to contact the researchers with information about the true nature of these suspected duplicates.

Public information

The team hopes that tools such as D£j£ vu or text-comparison software will act as deterrents to would-be plagiarists. "The fear of having some transgression exposed in a public and embarrassing manner could be a very effective deterrent. Like Dickens?s Ebenezer Scrooge, the spectre of being haunted by publications past may be enough to get unscrupulous scientists to change their ways," they write in the Nature commentary.

The scientists have also made their text-matching software, called eTBLAST, available to researchers and editors.

The idea of checking for duplication is catching on with journal publishers. Eight publishers included in CrossRef are taking part in a pilot test of an anti-plagiarism tool called CrossCheck, which uses text-matching algorithms by software company iParadigms. When the publishers receive a new manuscript, it is screened against CrossRef's database of existing articles, and papers that are suspiciously similar to existing ones are flagged up by the system for an editor to check. The pilot is expected to move to a working system in mid-2008, according to John Barrie, the CEO of iParadigms in Oakland, California.

More or less

There are few figures available for levels of plagiarism in scientific literature. A 2006 study of 280,000 preprints in the arXiv physics archive found much higher levels of suspect plagiarism (0.2%) and suspect duplicates by the same authors (10.5%). Another estimate comes from a 2002 survey of around 3,000 anonymous NIH authors: almost 5% said they had republished the same data or results, and just over 1% owned up to using another's ideas without permission or giving due credit (see 'Scientists behaving badly').

Garner and Errami speculate that their own results are different perhaps because arXiv hosts preprints that have not yet been refereed, whereas Medline journal articles have been mostly peer-reviewed. Additionally, the arXiv study compared full texts, whereas only abstracts were available for the current analysis from Medline.

The group plans to look further into the data to see what trends they can identify. Preliminary estimates show that a country's duplicate rates tend to be proportional to its total publications, but estimated duplication rates from China and Japan were roughly twice those expected.