Nature | News

Legal confusion threatens to slow data science

Researcher who spent months chasing permission to republish online data sets urges others to read up on the law.

Article tools

Steve Babuljak

Daniel Himmelstein, pictured at his previous research post at the University of California, San Francisco.

Knowledge from millions of biological studies encoded into one network — that is Daniel Himmelstein’s alluring description of Hetionet, a free online resource that melds data from 28 public sources on links between drugs, genes and diseases. But for a product built on public information, obtaining legal permissions has been surprisingly tough.

When Himmelstein, a data scientist at the University of Pennsylvania in Philadelphia, contacted researchers for permission to reproduce their work openly, several said they were surprised that he had to ask. “It never really crossed my mind that licensing is an issue here,” says Jörg Menche, a bioinformatician at the Research Center for Molecular Medicine of the Austrian Academy of Sciences in Vienna.

Menche rapidly gave consent — but not everyone was so helpful. One research group never replied to Himmelstein, and three replied without clearing up the legal confusion. Ultimately, Himmelstein published the final version of Hetionet in July — minus one data set whose licence forbids redistribution, but including the three that he still lacks clear permission to republish. The tangle shows that many researchers don’t understand that simply posting a data set publicly doesn’t mean others can legally republish it, says Himmelstein.

The confusion has the power to slow down science, he says, because researchers will be discouraged from combining data sets into more useful resources. It will also become increasingly problematic as scientists publish more information online. “Science is becoming more and more dependent on reusing data,” Himmelstein says.

Data-set laws

Because a piece of data — a fact — cannot be copyrighted, many scientists think that a publicly posted data set that does not place explicit terms and conditions on access can simply be republished without legal problems. But that’s not necessarily correct, says Estelle Derclaye, a specialist in intellectual-property law at the University of Nottingham, UK.

The European Union assigns specific database rights, independent of copyright, that aim to protect the investment made in compiling a database. Legally speaking, these rights prevent researchers such as Himmelstein from republishing data sets created by scientists in EU states without their consent.

Sergio Baranzini & Daniel Himmelstein

A visualization of Hetionet, a network that links data on genes (centre) to diseases, molecules and biological processes.

Other countries have different layers of legal protection. But even in jurisdictions such as the United States, where no separate rights exist to govern databases, there is still room for confusion. Although facts don’t qualify for copyright, the way they are compiled ­arguably might — if the act of making that compilation requires sufficiently creative expression. “The default legal position on how data may be used in any given context is hard to untangle,” according to a guide on licensing data issued by the Digital Curation Centre in Edinburgh, UK.

Advocates of data-sharing accordingly recommend that researchers who are creating public databases add clear licences explaining how they intend their data to be reused and redistributed, and whether they waive any database rights.

Lack of confidence

In Himmelstein’s case, some of the data sets that he wanted to use had clear licences ­— and some of these prevented unrestricted redistribution, but others did not. The most frustrating part of his project, he says, was the feeling that good data were going to waste because their creators could not clarify whether he could republish them.

Andrew Charlesworth, an intellectual-­property expert at the University of ­Bristol, UK, says that this may be because few re­­searchers were confident enough of the law to give Himmelstein clear guidance. “What you tend to find is that if nobody has a remit to answer those kinds of questions, they are not in a hurry to take it on,” he says.

Even without clear permissions, Himmelstein is unlikely to face legal penalties for publishing Hetionet, says Jonathan Band, an intellectual-property lawyer with the law firm Policy Bandwidth in Washington DC — unless, that is, he mistakenly breached terms and conditions placed on the data sets. Academics who put their data sets publicly online usually intend their work to be available for others to republish freely; and no one has ever got into trouble for doing Himmelstein’s kind of project, Band adds.

But Himmelstein is not convinced that he is legally in the clear — and feels that such ­uncertainty may deter other scientists from reproducing academic data. If a researcher launches a commercial product that is based on public data sets, he adds, the stakes of not having clear licensing are likely to rise. “I think these are largely untested waters, and most ­academics aren’t in the position to risk ­setting off a legal battle that will help clarify these issues,” he says.

Journal name:
Nature
Volume:
536,
Pages:
16–17
Date published:
()
DOI:
doi:10.1038/536016a

For the best commenting experience, please login or register as a user and agree to our Community Guidelines. You will be re-directed back to this page where you will see comments updating in real-time and have the ability to recommend comments to other users.

Comments for this thread are now closed.

Comments

3 comments Subscribe to comments

  1. Avatar for George McNamara
    George McNamara
    Daniel has nothing to fear - as long as he and his data stay in the United States. that data cannot be copyrighted but can be legally encumbered in other countries is pretty silly. The author of the story, Simon Oxenham, and editors, could have replaced the photo of Daniel H with a map of the world showing where data are free and clear - U.S. and Canada, vs enclosed in some nebulous black box of databaselegalese - the E.U., and made some effort to uncover how "data are facts, facts cannot be copyrighted", but might be handcuffed, in other parts of the world. I wrote in 2006, http://onlinelibrary.wiley.com/doi/10.1002/cyto.a.20304/full During the course of developing this data, one of us had an epiphany while reading in Lessig (18) about a U.S. Supreme Court decision: data is not subject to copyright (14). Text and commentary about Feist can be found on many legal web sites by doing a Google search. Indeed, the broad availability of the text of Supreme Court decisions is because they are not subject to copyright. The Feist decision reaffirmed the U.S. Copyright act of 1976 that “there can be no copyright in facts”. The basis for the Feist decision can be found in the U.S. Constitution. Several hundred of the PubSpectra traces were digitized by un-scanning. Some of this was because digitizing one or a few traces from well organized published figures is easy. Other data was digitized because the corresponding author never responded, was unable to find, or declined a request for data (contrary to grant public sharing requirements for NIH and NSF funded projects, see19). Several companies and researchers have been generous with making spectral data available, while others of the microscopy, flow cytometry, and spectroscopy communities not. This contrasts with the floods of data made available by the genome and microarray communities. Ironically, both of the latter communities have been successful because of the same light sources, optics, fluorescent dyes, and detectors used in fluorescence microscopy. We encourage all researchers to publish numerical spectral data as supplementary data to their online journal article and/or post the data on their own web site. We also encourage editors and reviewers to ask that all data be included as supplemental material at the time of manuscript review. The principle that data is not subject to copyright provides a framework in which all scientific data should be made freely accessible. Feist Publications, Inc. v. Rural Tel. Serv. Co. 1991; 499 U.S. 340. Lessig L. The Future of Ideas. New York: Random House; 2001. p 368. The U.S. Constitution is online at http://www.archives.gov/exhibits/charters/constitution.html with easier to read transcript at http://www.archives.gov/exhibits/charters/constitution_transcript.html The PubSpectra dataset is a Microsoft Excel file inside a zip file downloadable at https://works.bepress.com/gmcnamara/9 The organization of the spectra is one spectrum per column. One of the rows is the source attribution, either scientific paper or web source. I call on Nature Publishing Group to require all data accompanying a manuscript be included as supplemental file(s) that are free and clear from EU databaselegalase. For data that are too large or unwieldy to host at NPG, then the data should be deposited in a Open Data Access site(s) that are hosted in the U.S., Canada or other location not subject to databaselegalase. I also call on NPG to require that all graph traces and tables published in the main text or supplemental files include the numeric data as well.
  2. Avatar for ahmed kamel
    ahmed kamel
    Himmelstein raises some important points about the complexities of data sharing, most notably that "these are largely untested waters". Some will point to open licensing as a solution, however this may not be as simple a solution as one might think. A few examples of issues that need to be addressed before broad-based data licensing policies are adopted: 1. While in the EU there are database protection rights and it may be necessary and/or helpful for data sharers to waive these rights, in other jurisdictions (e.g. Canada, U.S.) this doesn't make sense. Scassa in 2013 that Canada's copying of UK language with respect to waiving database rights didn't make sense here. I would go a step further and ask whether other jurisdictions adopting database rights waiver language could become a precedent for the future establishment of database rights - could someone go to court in Canada or the U.S. in future and say, look, I didn't waive these rights, and thereby assert them by precedent? (Future comments by people with legal expertise would be helpful, I am not a lawyer). 2. Researchers often work with data that clearly does not belong to them, even without database rights. Consider health scientists who work with data provided by hospitals, social scientists who work with data provided by schools, non-governmental organizations, or businesses. As an example, even in the area I work in, open access, that one might think would be relatively unproblematic, the data that I use is not mine. I use data like the DOAJ downloadable metadata and article processing charges information from publishers' website. If I use the NPG APC information, does it become mine to license? 3. Attribution raises all kinds of problems. I am happy to see people use my data downstream, but as a new faculty member I need attribution and citations. If someone uses my whole dataset and downstream attributions are to them, not to me, for me this is a major problem. As an example of an issue that I see arising, how much of an original contribution should trigger a shift in primary attribution? If someone does a whole lot of work and uses a few of my datapoints, obviously they should be cited first. On the other hand, if someone uses all my dataset and adds just a little of their own work, that's a different story. 4. The related question of provenance is even more importance with respect to conducting quality downstream research. By provenance I mean correctly understanding where a particular data point came from, how it was derived. This is just one aspect of the emerging importance of research data management. Another example from my research, even what one might assume is the simple data point of "article processing charge", different researchers have slightly different ways of defining the data. 5. While mashing up datasets appears on the surface to provide tremendous potential for generating hypotheses, the utility of this approach in hypothesis testing is far from clear. The external validity of results obtained through data mash-up will be limited through such factors as research conducted using even slightly different definitions. This is important, because if this is not understood the internal validity of freely available mash-up datasets (i.e. anyone can obtain the same results) may cause researchers and/or readers to overestimate the external validity of results. In conclusion, I believe that opening up data for downstream research offers a great deal of promise to further our knowledge, and I actively practice open data sharing myself. However, I think at this time a great deal more research and thought is needed on the questions of data ownership (hence licensing), attribution, provenance and validity to maximize the potential benefits and avoid creating new downstream problems and (in the worst case scenario) apparently new knowledge which is actually false. Referenceاخبار اخبار مصر
  3. Avatar for Heather Morrison
    Heather Morrison
    Himmelstein raises some important points about the complexities of data sharing, most notably that "these are largely untested waters". Some will point to open licensing as a solution, however this may not be as simple a solution as one might think. A few examples of issues that need to be addressed before broad-based data licensing policies are adopted: 1. While in the EU there are database protection rights and it may be necessary and/or helpful for data sharers to waive these rights, in other jurisdictions (e.g. Canada, U.S.) this doesn't make sense. Scassa in 2013 that Canada's copying of UK language with respect to waiving database rights didn't make sense here. I would go a step further and ask whether other jurisdictions adopting database rights waiver language could become a precedent for the future establishment of database rights - could someone go to court in Canada or the U.S. in future and say, look, I didn't waive these rights, and thereby assert them by precedent? (Future comments by people with legal expertise would be helpful, I am not a lawyer). 2. Researchers often work with data that clearly does not belong to them, even without database rights. Consider health scientists who work with data provided by hospitals, social scientists who work with data provided by schools, non-governmental organizations, or businesses. As an example, even in the area I work in, open access, that one might think would be relatively unproblematic, the data that I use is not mine. I use data like the DOAJ downloadable metadata and article processing charges information from publishers' website. If I use the NPG APC information, does it become mine to license? 3. Attribution raises all kinds of problems. I am happy to see people use my data downstream, but as a new faculty member I need attribution and citations. If someone uses my whole dataset and downstream attributions are to them, not to me, for me this is a major problem. As an example of an issue that I see arising, how much of an original contribution should trigger a shift in primary attribution? If someone does a whole lot of work and uses a few of my datapoints, obviously they should be cited first. On the other hand, if someone uses all my dataset and adds just a little of their own work, that's a different story. 4. The related question of provenance is even more importance with respect to conducting quality downstream research. By provenance I mean correctly understanding where a particular data point came from, how it was derived. This is just one aspect of the emerging importance of research data management. Another example from my research, even what one might assume is the simple data point of "article processing charge", different researchers have slightly different ways of defining the data. 5. While mashing up datasets appears on the surface to provide tremendous potential for generating hypotheses, the utility of this approach in hypothesis testing is far from clear. The external validity of results obtained through data mash-up will be limited through such factors as research conducted using even slightly different definitions. This is important, because if this is not understood the internal validity of freely available mash-up datasets (i.e. anyone can obtain the same results) may cause researchers and/or readers to overestimate the external validity of results. In conclusion, I believe that opening up data for downstream research offers a great deal of promise to further our knowledge, and I actively practice open data sharing myself. However, I think at this time a great deal more research and thought is needed on the questions of data ownership (hence licensing), attribution, provenance and validity to maximize the potential benefits and avoid creating new downstream problems and (in the worst case scenario) apparently new knowledge which is actually false. Reference Scassa, T. (2013). Canada's draft new open license. http://www.teresascassa.ca/index.php?option=com_k2&view=item&id=113:canadas-new-draft-open-government-licence&Itemid=83
sign up to Nature briefing

What matters in science — and why — free in your inbox every weekday.

Sign up

Listen

new-pod-red

Nature Podcast

Our award-winning show features highlights from the week's edition of Nature, interviews with the people behind the science, and in-depth commentary and analysis from journalists around the world.