Social Selection Nature’s snapshot of science on social media

A scientist who mines the text of research publications was blocked by the scientific publisher Elsevier from downloading large numbers of its papers — a move that he described in a blog post that was shared by many on social media. Chris Hartgerink, a statistician at Tilburg University in the Netherlands, says that the publisher Elsevier is hindering his research. Elsevier allows text-mining through the use of a specific application programming interface (API), and says that this prevents their website from being slowed down by researchers who download large amounts of data. Frank Huysmans, a library scientist at the University of Amsterdam, linked to the blog post on Twitter:

Hartgerink is studying the prevalence of data fabrication in the social sciences for his doctoral project. He uses a content-mining tool to download in bulk and to search thousands of journal articles for patterns in data and text that may be associated with fabrication. In his blog post, he wrote that he had downloaded psychology papers, amounting to about 30 gigabytes of data, from Elsevier’s ScienceDirect website over roughly 10 days. Soon after, his university librarian, who had been notified by Elsevier of Hartgerink’s activities, asked him to stop. “His case is exemplary of the difficulties researchers run into who are using copyrighted material as research data,” said Huysmans in an email to Nature.

Text and data miners have struggled for years to get permission from publishers to use programs that extract data — such as brain images or chemical information — from thousands of publications. Some publishers resisted, fearing redistribution of their articles, and they considered permission requests only on a case-by-case basis. “We spent a lot of time negotiating with publishers because some of them are knee-jerk resistant to letting scientists text-mine,” says David Haussler, scientific director of the Genomics Institute at the University of California, Santa Cruz (UCSC). Haussler and his team run the UCSC Genome Browser, which catalogues genetic information scraped from the literature. “Scientists love this tool, and it’s only possible by text mining,” he says.

To try to make text-mining easier for researchers, Elsevier introduced their API in 2014, which allows researchers to download batches of articles in a computer-readable XML format. Some publishers, including Springer Nature (which publishes Nature), work with an organization called the Copyright Clearance Center to give text-mining researchers access to their content in a similar format.

Elsevier’s API was designed to prevent researchers from mass-downloading from its website, which could make the site slower for everyone, says Alicia Wise, Elsevier’s director of universal access. “We don’t want to slow down the response times,” says Wise. “Some of these [content] mining programs can be going after thousands of articles. We have millions of human users so we need to make the experience good for everyone.”

Hartgerink says that the Elsevier API does not allow him to pull the data he needs from tables and figures. He also objects to the requirement from Elsevier that research based on mined data be used only non-commercially. Hartgerink says that this would limit where he can publish his work and who can use it. “I want to be able to publish where I want to — not because Elsevier puts a restriction on me.”

Wise says that it is possible for researchers to apply for a licence to use their work commercially. These applications are approved on a case-by-case basis. She adds that there are no plans to let users circumvent the API or to change the rules surrounding its use. Hartgerink’s response is to stop using Elsevier publications in his research; instead, he says that he will text-mine other papers, including Peer J and those from BioMed Central, which are open-access.