Legal confusion threatens to slow data science

Article metrics

Researcher who spent months chasing permission to republish online data sets urges others to read up on the law.

Daniel Himmelstein, pictured at his previous research post at the University of California, San Francisco. Credit: Steve Babuljak

Knowledge from millions of biological studies encoded into one network — that is Daniel Himmelstein’s alluring description of Hetionet, a free online resource that melds data from 28 public sources on links between drugs, genes and diseases. But for a product built on public information, obtaining legal permissions has been surprisingly tough.

When Himmelstein, a data scientist at the University of Pennsylvania in Philadelphia, contacted researchers for permission to reproduce their work openly, several said they were surprised that he had to ask. “It never really crossed my mind that licensing is an issue here,” says Jörg Menche, a bioinformatician at the Research Center for Molecular Medicine of the Austrian Academy of Sciences in Vienna.

Menche rapidly gave consent — but not everyone was so helpful. One research group never replied to Himmelstein, and three replied without clearing up the legal confusion. Ultimately, Himmelstein published the final version of Hetionet in July — minus one data set whose licence forbids redistribution, but including the three that he still lacks clear permission to republish. The tangle shows that many researchers don’t understand that simply posting a data set publicly doesn’t mean others can legally republish it, says Himmelstein.

The confusion has the power to slow down science, he says, because researchers will be discouraged from combining data sets into more useful resources. It will also become increasingly problematic as scientists publish more information online. “Science is becoming more and more dependent on reusing data,” Himmelstein says.

Data-set laws

Because a piece of data — a fact — cannot be copyrighted, many scientists think that a publicly posted data set that does not place explicit terms and conditions on access can simply be republished without legal problems. But that’s not necessarily correct, says Estelle Derclaye, a specialist in intellectual-property law at the University of Nottingham, UK.

The European Union assigns specific database rights, independent of copyright, that aim to protect the investment made in compiling a database. Legally speaking, these rights prevent researchers such as Himmelstein from republishing data sets created by scientists in EU states without their consent.

A visualization of Hetionet, a network that links data on genes (centre) to diseases, molecules and biological processes. Credit: Sergio Baranzini & Daniel Himmelstein

Other countries have different layers of legal protection. But even in jurisdictions such as the United States, where no separate rights exist to govern databases, there is still room for confusion. Although facts don’t qualify for copyright, the way they are compiled ­arguably might — if the act of making that compilation requires sufficiently creative expression. “The default legal position on how data may be used in any given context is hard to untangle,” according to a guide on licensing data issued by the Digital Curation Centre in Edinburgh, UK.

Advocates of data-sharing accordingly recommend that researchers who are creating public databases add clear licences explaining how they intend their data to be reused and redistributed, and whether they waive any database rights.

Lack of confidence

In Himmelstein’s case, some of the data sets that he wanted to use had clear licences ­— and some of these prevented unrestricted redistribution, but others did not. The most frustrating part of his project, he says, was the feeling that good data were going to waste because their creators could not clarify whether he could republish them.

Andrew Charlesworth, an intellectual-­property expert at the University of ­Bristol, UK, says that this may be because few re­­searchers were confident enough of the law to give Himmelstein clear guidance. “What you tend to find is that if nobody has a remit to answer those kinds of questions, they are not in a hurry to take it on,” he says.

Even without clear permissions, Himmelstein is unlikely to face legal penalties for publishing Hetionet, says Jonathan Band, an intellectual-property lawyer with the law firm Policy Bandwidth in Washington DC — unless, that is, he mistakenly breached terms and conditions placed on the data sets. Academics who put their data sets publicly online usually intend their work to be available for others to republish freely; and no one has ever got into trouble for doing Himmelstein’s kind of project, Band adds.

But Himmelstein is not convinced that he is legally in the clear — and feels that such ­uncertainty may deter other scientists from reproducing academic data. If a researcher launches a commercial product that is based on public data sets, he adds, the stakes of not having clear licensing are likely to rise. “I think these are largely untested waters, and most ­academics aren’t in the position to risk ­setting off a legal battle that will help clarify these issues,” he says.

Additional information

Tweet Facebook LinkedIn Weibo Wechat

Related links

Related links

Related links in Nature Research

Digital badges motivate scientists to share data 2016-May-12

Legal tussle delays launch of huge toxicity database 2016-Feb-11

Trouble at the text mine 2012-Mar-07

Related external links

Hetionet

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Oxenham, S. Legal confusion threatens to slow data science. Nature 536, 16–17 (2016) doi:10.1038/536016a

Download citation

Further reading

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.