Tensions grow as data-mining discussions fall apart

Scientists want to exempt computer-based text crawling from Europe’s copyright law.

Disagreement between scientists and publishers has grown on a thorny issue: how to make it easier for computer programs to extract facts and data from online research papers. On 22 May, researchers, librarians and others pulled out of European Commission talks on how to encourage the techniques, known as text mining and data mining. The withdrawal has effectively ended the contentious discussions, although a formal abandonment can be decided only after a commission review in July.

Scientists have chafed for years at limitations on computer-aided research. They would like to use computer programs to crawl over thousands or millions of articles and other online research content, extracting data to build up databases or to pick out patterns such as associations between genes and diseases.

But in many parts of the world, including Europe, this sort of use currently requires permission from the content’s copyright owner. Even if an institution has paid to access a journal, its academics do not necessarily have permission to mine the text. Publishers, worried that their content might be redistributed for free, tend to block data-mining programs, giving extra licence permissions only on a slow, case-by-case basis (see Nature 483,134–135; 2012). And although authors can now choose to publish under licences that explicitly allow text mining, that innovation doesn’t help text-miners wanting to run programs on decades of pre-existing content.

Rather than struggle through a thicket of different permissions set by publishers, some researchers want Europe to exempt text mining from copyright law — allowing them to run programs on content that they have paid for, and on free content, without fear of copyright breach. Last year, the UK government said that it plans to introduce exemptions for non-commercial purposes. Lenient ‘fair use’ rights in the United States may already allow text mining, depending on how the law is interpreted.

“There is an intense debate on this within the scientific and research community, with a large number of scientists pointing at the limits of the current copyright regulatory regime,” says Ryan Heath, a spokesman for European Commission vice-president Neelie Kroes. “This is a very serious issue, impacting on scientific excellence and innovation in Europe.”

To tackle the issue, last December the commission set up a working group — one of a number under a framework called Licences for Europe — to open discussions about new policies among publishers, researchers, librarians and other interested parties, such as technology companies. In late February, researchers complained in a letter to the commission that the group was constrained to discuss only text-mining licences, and not changes to copyright law (see Nature 495, 295; 2013) — a restriction that would “make computer-based research in many instances impossible”.

“Every researcher I’ve spoken to thinks licensing is a problem,” says Susan Reilly, projects manager at the Association of European Research Libraries in the Hague, the Netherlands. She coordinated the letter that declared the 22 May withdrawal from talks. “There was really no point in us continuing to attend,” she says. Other signatories include the non-profit Open Knowledge Foundation in Cambridge, UK, and the National Centre for Text Mining at the University of Manchester, UK.

“Continuing the group under current circumstances doesn’t make sense,” says Heath. “This is regrettable, but at least the process brought to the fore the major controversies in this area.” The European Commission, he adds, “will reflect on the implications and will address the matter at the time of the review of the Licences for Europe process in July”.

The European talks had always been conflicted because four different European Union administrative departments were involved — not only the department for research and innovation, but also those for education and culture, for media and information issues, and for Europe’s internal market, economy and intellectual-property rights. (The May letter argues that the research department is being squeezed out in favour of the others’ interests.)

“Since the Licences for Europe process has not managed to deliver in this area, other ways forward must be explored,” says Heath. An analysis under way by the commission’s internal-market department on the need for copyright reform may provide impetus for action, should it conclude that changes are needed.

Many publishers say that there are practical, as well as legal, barriers to text mining. Even if the practice were permitted through licences or changes to copyright law, researchers would still need a way to access websites without crippling publisher servers through excess traffic. And publishers want to be able to identify the purpose of the programs crawling their content, especially if mining is for commercial means, so as to decide “what they’re willing to allow at what cost”, says Sarah Faulder, chief executive of the Publishers Licensing Society in London, an industry body that took part in the talks.

To lower some of these practical barriers, the non-profit publisher collaboration CrossRef hopes to launch technology this year enabling text-mining researchers to agree to terms by clicking a button on a publisher’s website.

Discussions may have faltered, but scientists and librarians hope to keep talking to officials, says Reilly. “There’s lots of disagreement even among publishers,” she says. “Some are open to text and data mining, some are completely frightened of it. They need an informed discussion.”

