My data are your data

Journal name:
Nature Biotechnology
Volume:
30,
Pages:
509–511
Year published:
DOI:
doi:10.1038/nbt.2243
Published online
Corrected online

Encouraging more broad and inclusive data sharing in today's world will involve concerted community efforts to overcome technical barriers and human foibles. Vivien Marx investigates.

Introduction

Relic of the past? Data are trying to jump off shelves. (istockphoto).

In January, over 50 researchers from 30 academic and commercial organizations agreed on a standard for describing data sets. The BioSharing initiative, comprising both researchers and publishers, launched the Investigation-Study-Assay (ISA) Commons, which promises to streamline data sharing among different databases1. Life scientists have thousands of databases, over 300 terminologies and more than 120 exchange formats at their disposal, says BioSharing co-founder Susanna-Assunta Sansone of the University of Oxford. In this era of collaborative big science, researchers only move forward by “walking together.”

Although increased data sharing is central to scientific progress and is attracting attention from many quarters2, standards are only some of the stars that must align to make it possible.

Share and share alike?

Oversharing is embarrassing in social media but sharing is always a virtue for scientists. Although many scientists embrace the idea of sharing data in research, few manage it in practice. The traditional sharing method—the research paper—has made the transition online through HTML and PDF formats, but is becoming outmoded and unwieldy as life science research generates increasingly varied types of big data sets to be associated with a claim. These data sets quickly pile up as high-throughput instruments spray a fire hose of data in giga- or terabyte-sized files. Data are also getting an ever longer tail: the provenance of claims, hypotheses and arguments, along with supporting metadata, methods, software code and tools, multimedia, workflows and models. These challenges are spurring initiatives, such as BioSharing and several other nonprofit and commercial projects, to make sharing easier and more palatable (Supplementary Table 1).

Sharing “isn't just a good idea, it's essential,” says John Quackenbush of the Dana-Farber Cancer Institute and Harvard School of Public Health in Boston. “The progress of science has been slowed by failure to share data and methods,” he says. A physicist turned computational biologist, Quackenbush builds analysis tools and platforms, for example, for the Lung Genomics Research Consortium, which let users crunch through high-dimensional data sets (Box 1). Besides the risk of duplicating research, “it’s also about discoveries and breakthroughs that would be entirely missed if data and methods aren’t shared,” says David De Roure of the University of Oxford, who strategizes on the future of research methods for the UK's e-Science program.

Box 1: Letting data speak

Fair share?

Now that high-throughput devices, such as next-generation sequencers, let laboratories generate mountains of data overnight, biomedical researchers have plenty to feed computers down the hall or in a cloud. Even scientists without technology access can use their colleagues' data, commonly shared through online repositories, such as the US National Institutes of Health (NIH) National Center for Biotechnology Information (NCBI).

In the mid-1990s, NCBI's GenBank, one of the first public repositories to hold DNA sequences, was disseminated on a CD that came in the 'snail' mail. Today, researchers download dozens of terabytes of information a day from NCBI's resources, including GenBank, says NCBI director David Lipman. He acknowledges that some scientists might not always quickly deposit all their genome or expression data. But one positive move does follow another in a “virtuous circle,” he says.

Researchers harvest data, analyses, software tools and insight from papers and online portals in hunting-and-gathering activity that is a “hodgepodge of methods” that functions well enough, says Steven Salzberg from the Johns Hopkins University School of Medicine. One problem, he cautions, is that the number of studies providing all the ingredients needed to reproduce a study's findings “is still relatively small.” Thus, the consistent and full sharing of data and methods still isn't the norm3, 4.

Scientists' motivation to share has always been mixed, according to Bob Campbell, editor at Wiley-Blackwell in Oxford and publishing historian. “Newton was famously secretive,” he says. To claim credit, researchers will switch into sharing mode. “Darwin delayed publication until he realized that Wallace was on to the same idea.”

Expanding the outlook

What data sharing in science looks like tomorrow may be utterly unlike today's practices. De Roure promotes abandoning the idea that “knowledge has to be exchanged in paper-sized chunks.” He points to the Force11 group, formed in 2011 by scientists, librarians, research funders and editors at several publishing houses, which strives for “semantically enhanced, media-rich digital publishing.”

The group's 'manifesto' published last fall recommends that we “rethink the unit and form of the scholarly publication,” to move beyond the electronic facsimiles of an ink and paper journal publication toward digitally enabled complex entities. A world of “networked knowledge objects” may seem a futuristic scenario, but efforts now underway to give data, methods and metadata their own web-enabled voice already tug science-sharing in that direction5.

Some researchers are pushing publishers for computational access to the scientific literature so machines can harvest and sharefindingsmore quickly than human readers. These ventures head into choppy seas. Since beginning their Genocoding Project in 2009, Maximilian Haeussler of the University of California at Santa Cruz and his colleague Casey Bergman, genomicist at the University of Manchester, UK, have received a frosty reception from publishers6, 7.

Haeussler's software tool scans papers for genomic identifiers and maps them to the human genome. In a log of responses to permission requests (http://text.soe.ucsc.edu/), the Genocoding site currently lists only 5 consents from publishers; 12 are thinking about it and 26 have issued outright denials.

Tune into my network

Systems biology researchers already exchange models, sequences and software code about networks of interacting genes and proteins, but “we can't store working hypotheses,” says Trey Ideker at the University at San Diego, who developed Cytoscape software to analyze and visualize biological networks.

Researchers include network models in papers, many of which inevitably get stuffed into the supplementary online sections of publications. But only a “very small set of people” will discover that additional material and “an even smaller set” will write a software script to pull that model into their own, he says.

To explore new ways of exchanging models, Ideker is involved with Seattle-based Sage Bionetworks, a nonprofit that reaches out to the scientific community to encourage sharing. Founded in 2009 by former Merck researchers Stephen Friend and Eric Schadt, Sage is an open repository for different kinds of data and network models, says Ideker. Separately, Ideker's team is working on a Cytoscape-based database for sharing network models.

Sharing a special sauce

Through web portals, computational biologists share software tools and code in their papers, and they create web services, often installed on clouds, where users can run others' software. Cloud computing standardizes the hardware, remedying one challenge in software sharing, according to Ideker. Scientists using a cloud can no longer say, for example, “I could not get your software to run on my version of Unix,” he says.

Cloud or no cloud, Salzberg says that reproducing some types of in silico–based analyses remains a challenge. In a paper comparing leading genome-assembly programs, he and his group shared software and data. He included details about his software methods, “the 'special sauce' you need to run these assembly programs and reproduce our results,” and which can be quite complex. The recipes are also on his laboratory's website and in a journal supplement8.

In an independent effort called the Assemblathon, a sort of bake-off of genome assembly programs, details were lacking on how the assemblers were run, “making it impossible to reproduce those computational experiments,” Salzberg says. This absence of pertinent information is the norm, in his view.

Sharing the “special sauce” of methods along with data and metadata is “still an enormous problem,” says Emory University computational biologist James Taylor, who co-developed the Galaxy platform of open-source sequence analysis tools, which can be downloaded or used on the cloud. And although publications lack fully detailed methods, policing that gap is too burdensome for reviewers, he says. Galaxy has an idea, called Pages, to stave off that chore for scientists.

Scientists are starting to use Galaxy Pages, a functionality for sharing data and workflow analysis steps. A scientist reading a published paper can visit the study's Galaxy Page to plug different data into the computational workflow and make high-throughput analysis interactive. “You can play with it, change the parameters, edit the workflows, they are now yours,” says Penn State University biologist Anton Nekrutenko, Galaxy's co-developer. Through Galaxy, laboratories can find methods that others successfully use, says Taylor. He and Nekrutenko agree that genome assembly is particularly challenging and echo Salzberg's call for sharing methods detail.

Not sharing workflows can lead research down blind alleys or worse, biomedical wild goose chases, when an experiment yields rare variants, which might or might not be correct, he says. He points to a human resequencing project focused on heteroplasmy, which is when the mitochondrial genomes in one organism's cells or tissues show several variants. Nekrutenko and his Penn State colleagues showed that misalignment led to erroneous findings about heteroplasmies at certain spots in the genome from nine examined individuals in three families9. Galaxy wants to help researchers avoid such pitfalls. Sharing methods and tools can lead to community-based decisions on best practices, says Taylor.

When pipelines creak

To make analysis steps sharable, some laboratories want to embed clickable computational workflows into a paper. To do so, a journal article must be able to link to remote computing and data resources. Workflows have mixed stewardship in their orchestration of local and remote computational resources operated by different groups and companies, says University of Manchester computer scientist Carole Goble, whose team has built many web-based sharing platforms for the life sciences. Dashing through its paces, a workflow pipeline may break: software on a site crashes, a web service changes data format, funding lapses put a web service offline, she says. A sharable, reproducible workflow becomes unsharable and irreproducible.

With their platform Workflow 4Ever, Goble and colleagues across Europe want to lend longevity to methods sharing by preserving workflows. If a pipeline breaks, a record shows the previous iteration that ran smoothly. It avoids time spent on glitch hunting. Researchers know where to tweak the workflow.

The human factor

Better ways are needed to encourage sharing and recognize the effort of those producing data and methods and subsequently sharing them, says Quackenbush, lamenting that these efforts are still seen as the ugly stepchildren of modern science. Portals for data or methods sharing often suffer because they are built by “data geeks” who do not keep users in mind, he says.

Platform builders must “respect people's fears,” says Goble. When ready, researchers will share with trusted collaborators, but commands to share can fall on deaf ears. She prefers to credit people for sharing. Not only must these environments be “safe havens,” they should let scientists build “a reputation economy” around the objects being shared, so others see who shares how much, she says.

Quackenbush and his team designed the data coordination and analysis center for the Lung Genomics Research Consortium, in which teams across the United States characterize a biological sample collection for hints about lung disease. Although the LGRC research portal is open, it does not offer sequence-based data, which must go to the NIH database of Genotypes and Phenotypes, where controlled data access policies are in place and individuals are de-identified.

His group chose analytical methods and presentation formats that let users edit genomic content without having to be computer scientists. “It's not intelligence but rather empathy that drives a good user experience,” he says.

Users who are computational and statistical scientists prefer doing analysis right away. But they like filtered data chunks, so the team created a “shopping cart” to allow this filtering. Nonquantitative scientists looking for specifics about a gene or a pathway are the largest group of users. For them, the team made decisions about the major phenotypes and present “reasonable summaries of the data, based on looking at what people wanted to know,” Quackenbush says. His team expanded a commercial tool, ClinicalSense, from the software and consulting firm ID Business Solutions headquartered in Guildford, UK, to organize clinical data, so users can define phenotypic groups on the fly if they do not like predefined types.

The sample and data tracking system makes transparent to group members who has samples and who produces data, he says. One of the “objective measures of success” is sharing data by depositing it in the portal. Report season in this collaborative effort shows who is meeting production targets, which the project scientists set themselves.

“We're all human, so as you might expect, data deposition spiked before calls and in-person meetings,” Quackenbush says. The system's transparency has a number of advantages. The team can quickly see a problem when array data do not match a patient's exome sequence profile. “So our 'due diligence' in measuring output had some very positive side effects,” he says.

Goble adds an ingredient to her platforms that she calls “sharing creep.” In the web-based platform myExperiment, users share workflows or keep them private. First, a scientist must be ready to deposit information, which is a step reluctantly taken, as it is the “expensive” and “undervalued” task of sharing and describing what is deposited. A follow-on chapter is persuading scientists to share with others. A project will garner few joiners if it begins by proclaiming, in the name of “open science fundamentalism,” that everything must be open, Goble says.

She and her team also expanded on a commercial tool: the spreadsheet. When developing a data-sharing platform called SysMO-SEEK for the European Systems Biology of Microorganisms program, she and her team made the spreadsheet readable by people and machines. The team equipped spreadsheets with objects that adhere to minimum information models for different 'omics classes. When an experimentalist enters data, the spreadsheet cells offer drop-down menus with terms, standardizing categories and making them easier to share. The tool, called RightField, does “semantic annotation by stealth,” Goble says.

Users need not bother with the fact that beneath the spreadsheet sits a controlled vocabulary that uses the same terms to describe the same things, an important element of sharing. The vocabulary is connected to BioPortal, a repository run by the US National Center for Biomedical Ontology, an NIH-funded consortium that is part of the National Centers for Biomedical Computing.

RightField reflects Goble's strategy of building “ramps” around tools already in use. That approach gets scientists to “just do that little bit more” to enrich research assets so they can be more readily shared. Rather than ask people to fill out a six-page sharing template, one needs need “spoons full of sugar to make the medicine go down.”

Change history

Corrected online 07 December 2012
In the version of this article initially published, the statement on p. 509, column 2, top paragraph, was missing a few words, making it a fragment: “Besides the risk of duplicating research discoveries that would be missed if data and methods aren't shared,’” should have read, “Besides the risk of duplicating research, ‘it’s also about discoveries and breakthroughs that would be entirely missed if data and methods aren’t shared’….” On p. 511, column 2, paragraph 3, it was stated that the Lung Genomics Research Consortium (LGRC) shares data within the consortium but not with a wider community. In fact, that is not the case, and the sentence now reads, “Although the LGRC research portal is open, it does not offer sequence-based data, which must go to the NIH database of Genotypes and Phenotypes, where controlled data access policies are in place and individuals are de-identified.” Also on p. 511, in column 3, paragraph 2, the European Systems Biology of Microorganisms program was incorrectly identified as SysMo-SEEK. It is SysMo. SysMO-Seek is a data-sharing platform developed for the 13 consortia that make up SysMo. The sentence now reads, “When developing a data-sharing platform called SysMO-SEEK for the European Systems Biology of Microorganisms program…”. The error has been corrected in the PDF and HTML versions of this article.

References

  1. Sansone, S.-A. et al. Nat. Genet. 44, 121126 (2012).
  2. National Science Board Task Force on Data Policies. Digital Research Data Sharing and Management (NSF, December 2011). http://www.nsf.gov/nsb/publications/2011/nsb1124.pdf
  3. Tenopir, C. et al. PLoS ONE 6, e21101 (2011).
  4. Carpenter, J., Tanner, S., Smith, N. & Goodman, M. Researchers of Tomorrow: a Three Year (BL/JISC) Study Tracking the Research Behaviour of 'Generation Y' Doctoral Students. Annual Report (British Library/Joint Information Systems Committee, May 2011).
  5. De Roure, D., Bechhofer, S., Goble, C. & Newman, D. Scientific social objects: the social objects and multidimensional network of the myExperiment website. Presented at 1st International Workshop on Social Object Networks (SocialObjects 2011), October 2011, Boston, MA, US.
  6. Van Noorden, R. Nature 483, 134135 (2012).
  7. Anonymous. Gold in the text? Nature 483, 124 (2012).
  8. Salzberg, S.L. et al. Genome Res. 22, 557567 (2012).
  9. Goto, H. et al. Genome Biol. 12, R59 (2011).
  10. Altman, R. et al. Genome Biol. 9 (suppl. 2), S7 (2008).
  11. Mons, B. et al. The value of data. Nat. Genet. 43, 281283 (2011).

Download references

Supplementary information

Word documents

  1. Supplementary Text and Figures (20 KB)

    Supplementary Table

Additional data