Publisher under fire for fake article webpages

'Trap’ URLs can help publishers to catch automated downloading, but critics say that the approach is clumsy.

An online debate is swirling around a tactic that academic publisher John Wiley & Sons uses to fight online piracy. The company created a webpage, accessible by several URLs, that appeared to be an academic paper to automated downloading programs. But any users who accessed the URLs were then blocked from viewing other Wiley content. Wiley and other publishers use these ‘trap’ URLs — which are invisible to most human users — to detect and prevent unauthorized downloading and republishing of their content. But some users say that the tactic is too heavy-handed.

Computational biologist Richard Smith-Unna of the University of Cambridge, UK, brought Wiley’s use of trap URLs to light in late May, after several users were locked out of the university’s Wiley subscription. Smith-Unna, who says that he inadvertently accessed some trap URLs during a data-mining project, tweeted about the lock-out and posted a Google doc containing the URLs. The post incited an outcry from scientists and librarians on social media, and curious onlookers who clicked on the URLs also reported losing access to Wiley content.

Wiley says that it is no longer blocking people who used these URLs. “We apologize if this caused any inconvenience to users,” the publisher told Nature.

Locked out

Smith-Unna says that he stumbled across the URLs while mining the text and metadata of Wiley content using his university library’s subscription. He wanted to analyse how accurately data can be extracted from scientific papers that are stored in different file formats such as PDF, XML and HTML.

To find the articles, Smith-Unna wrote a program that used an article’s digital object identifier (DOI) to construct its URL and then download the article in PDF form. He fed the program a list of DOIs from his own database of scholarly metadata, which included data he’d obtained from sources such as Crossref — the group that provides DOIs — and online ‘data dumps’. Between 18 and 22 May, Smith-Unna downloaded 30,000 Wiley articles in three batches.

About a week later, a Cambridge University computer officer told Smith-Unna that Wiley had contacted the university library to warn that someone who had logged into the university’s virtual private network (VPN) might have had their account hacked. The library’s VPN logs showed that Smith-Unna had been logged in around that time.

A Wiley spokesperson says that the company revoked access to a number of Cambridge users who clicked on the trap URLs for a period of time.

On May 29, Smith-Unna posted a Google Doc that included a list of more than 150 trap URLs hosted by Wiley. “Never suspecting anyone would be so arrogant as to pollute a scientific corpus with fake data, I attempted to use these DOIs for a legitimate academic mining project in good faith (one which I had pre-informed the library about),” he wrote in the Google Doc.

Don’t click

Librarians have criticized Wiley’s approach. Wayne State University’s Clayton Hayes wrote in a blog post on 2 June that publishers should combat unauthorized web crawling, but that using trap URLs that incorporate DOI-like identifiers compromises the integrity of academic publishing. “That is why we should find this kind of behavior concerning,” he wrote. “Because it demonstrates that supporting research is not the chief priority of these publishers.”

Eric Hellman, a library technologist with the Free Ebook Foundation, blogged on 22 June that trap URLs can be useful tools for publishers. But, he added in a tweet, Wiley’s were not particularly sophisticated.

Hellman says that there are better ways to stop automated downloads, such as slowing down the connection of a program that might be mass-downloading material.

Geoffrey Bilder, the director of strategic initiatives at Crossref, says that trap URLs inevitably become visible to people — with Smith-Unna just one example. On 22 June, Carl Bergstrom, an evolutionary biologist at the University of Washington in Seattle, tweeted that after reading Hellman’s blog post, he had clicked on one of the false URLs Smith-Unna had uncovered and lost access to Wiley content. Bergstrom found himself on a page showing a largely blank article published in 2013 entitled “Constructive Metaphysics in Theories of Continental Drift,” written by “J.N. Smith”. Within seconds, Bergstrom had lost access to Wiley content, including his own open-access paper (although he regained it using a different browser).

“They’re blocking content that they shouldn’t be,” Bergstrom says. “It’s a very coarse solution.”

Although Wiley says that it has disabled these links, the company declined to comment on whether it will continue to use trap URLs. In a statement on the Liblicense listserve on 22 June, Tom Griffin of Wiley wrote that the URLs were meant to be visible only internally and to library security officers. “We do not know how they came to be known more widely.”

But on 27 June, Bergstrom again managed to block access to his own paper by clicking on the link in Hellman’s blog post.  “At this point, any claims from Wiley that there is no IP-level blocking or that the problem has been resolved are extremely suspect to me,” he says.

User beware

Wiley isn’t the only academic publisher that uses trap URLs. In 2014, the American Chemical Society (ACS) blocked 200 users after they inadvertently clicked on a hidden URL called a spider-trap, according to a report in Chemical and Engineering News, an ACS publication. (The ACS declined a request for comment.)

Ross Mounce, a biodiversity informaticist at the University of Cambridge, fell into a URL trap in 2014 while working at the Natural History Museum in London. As researchers develop better data-mining tools, he says, publishers might also develop better defenses against them. “I'm worried we're seeing the beginning of an evolutionary arms race between legacy publishers and researchers.”

A spokesperson for Nature’s publisher, Springer Nature, says that the company does not use trap URLs.

