Whether from the petabytes of data produced by the Large Hadron Collider, or the hundreds of millions of bases in the human genome, much scientific analysis nowadays relies on computers to pull out meaning from swathes of data. But one vast store of information, the research literature, has so far seemed immune to computer analysis. By and large, articles exist only in formats designed for humans to read — such as this paragraph.
Text-mining aims to break down this barrier. Using natural-language-processing concepts honed over the past 30 years, computer programs are starting to pull out information from plain text, including patents and research articles. Right now, the software requires highly skilled operators, but in the next decade it might transform the way scientists read the literature. Text-miners hope to make scientific discoveries by scouring hundreds of research papers for associations and connections (such as between drugs and side effects, or genes and disease pathways) that humans reading each paper individually might not notice.
The promise is yet to be backed up with concrete examples of scientific success — although in the pharmaceutical industry, text-mining companies are already working with researchers to speed up drug discovery. But academics are struggling to even run experiments — because publishing licences do not let them text-mine research papers, and publishers are slow to respond to text-mining requests. Fed up after two years of negotiations, one team of researchers is launching a public website to log publishers' responses (see page 134).
There is no doubt that a completely open research literature would make it easier to demonstrate how such machine-reading can lead to scientific discovery. But the question is how to make progress today, when much research lies behind subscription firewalls and even 'open' content does not always come with a text-mining licence (including 83% of the 'free' research in the PubMedCentral online archive).
Publishers should agree that scientists who have already paid for access to research papers may text-mine content at no extra cost and publish their findings — as long as their doing so does not breach the original firewall. Publishers can have no claim on the data in articles, only on the way in which the articles have been edited and formatted. They should make their text-mining policies clear and consider following the example of the journal Heredity, which says it is “seeking to encourage text-mining experiments”. (Its publisher, Nature Publishing Group, which also publishes this journal, says that it does not charge subscribers to mine content, subject to contract.)
On the other hand, text-miners need to make a better case for their technology. They say they are in a catch-22 situation — how can they demonstrate the benefits if they aren't allowed to run experiments on the literature? Instead, they text-mine abstracts, usually by picking out key words — a pale shadow of what full-text-mining might offer. Casey Bergman at the University of Manchester, UK, is chronicling projects that have tried to text-mine the available PubMedCentral content (see go.nature.com/2pqp8g) and finds very few examples — suggesting that text-miners are reluctant even to mine the corpus of free content.
Publishers point out that they receive few text-mining requests, so the field can't be very hot. So unless text-miners start to make full use of the content that is available, and request more access to published content — while always being clear about how their project will benefit science — the unsatisfactory impasse will continue.