GARY WATERS/IKON IMAGES/CORBIS
Lizzie Wolkovich always felt she ought to make her research data freely available online. “The idea that data should be public has been in the background through my entire career,” she says.
Yet in 2003–09, while she was working on her ecology PhD, there were few incentives for her to share. Sharing would not help her to get grants or publications, and although posting data online was not unheard of, few researchers actually did it, she says. Many preferred to hang on to their hard-won field data, sharing privately if they did so at all.
But after she earned her doctorate, Wolkovich overcame her hesitation, thanks to a combination of helpful colleagues, improved resources and a discernible shift in the research community's attitude. So in 2010, through an online data repository called the Knowledge Network for Biocomplexity, Wolkovich released her doctoral data set — the fruit of thousands of hours spent measuring the diversity of arthropods in 56 experimental soil plots she had set up in the arid scrubscape of southern California. Since then, she has publicized all the data that she has collected, including a meta-analysis of 50 other studies that she examined to see how factors such as rising temperatures affect the life cycles of plants. Wolkovich, now at the University of British Columbia in Vancouver, Canada, says that she herself had never objected to sharing her results — she had just not known how to do so. She likes the fact that her data are now easily accessible to other researchers and anyone else who is interested. “It saves me so much time,” she says.
Wolkovich is one of a number of early-career researchers who are enthusiastically posting their work online. They are publishing what one online-repository founder calls small data — experimental results, data sets, papers, posters and other material from individual research groups — as opposed to the 'big data' spawned by large consortia, which usually employ specialists to plan their data storage and release. The many resources now available give researchers options for where and how to post their data, releasing potentially fruitful data sets that used to be locked up in unpublished paper files, buried in journal-article appendices or hidden away on scientists' hard drives.
Open data-sharers are still in the minority in many fields. Although many researchers broadly agree that public access to raw data would accelerate science — because other scientists might be able to make advances not foreseen by the data's producers — most are reluctant to post the results of their own labours online (see Nature 461, 160–163; 2009). When Wolkovich, for instance, went hunting for the data from the 50 studies in her meta-analysis, only 8 data sets were available online, and many of the researchers whom she e-mailed refused to share their work. Forced to extract data from tables or figures in publications, Wolkovich's team could conduct only limited analyses.
Some communities have agreed to share online — geneticists, for example, post DNA sequences at the GenBank repository, and astronomers are accustomed to accessing images of galaxies and stars from, say, the Sloan Digital Sky Survey, a telescope that has observed some 500 million objects — but these remain the exception, not the rule. Historically, scientists have objected to sharing for many reasons: it is a lot of work; until recently, good databases did not exist; grant funders were not pushing for sharing; it has been difficult to agree on standards for formatting data and the contextual information called metadata; and there is no agreed way to assign credit for data.
But the barriers are disappearing, in part because journals and funding agencies worldwide are encouraging scientists to make their data public. Last year, the Royal Society in London said in its report Science as an Open Enterprise that scientists need to “shift away from a research culture where data is viewed as a private preserve”. Funding agencies note that data paid for with public money should be public information, and the scientific community is recognizing that data can now be shared digitally in ways that were not possible before. To match the growing demand, services are springing up to make it easier to publish research products online and enable other researchers to discover and cite them. There are so many, in fact, that choosing where and how to publish data sets and other supplementary material can be confusing (see 'Abundant options').
Box 1: Abundant options
Online data repositories are proliferating: the searchable catalogue Databib lists 594 websites. Hundreds are specialists, devoted to particular kinds of data. But general-purpose repositories do exist: they include Dryad, which many scientists use to store the data underlying their publications; GitHub, which is usually used to host software code and to collaborate on developing it, but also hosts other data; European Commission repository ZENODO; and figshare.com, a general repository for posters, papers and data sets that welcomes negative results that would otherwise never be published. Publishers have started to launch journals dedicated to data sets and descriptions of data, such as BioMed Central's GigaScience. Some scientists post data on social networks such as ResearchGate or Academia.edu.
Each discipline is evolving its own ways to structure data and metadata. In biology alone, biosharing.org lists some 530 standards, including MIAME (Minimum Information About a Microarray Experiment) and PDB (Protein Data Bank format). To avoid confusion, researchers should familiarize themselves with the best practices in their fields. R.V.N.
“Lots of people are getting into data-hosting, and I think it will be tricky to decide where to put your data,” says Heather Piwowar, who studies data-sharing for the US National Evolutionary Synthesis Center in Durham, North Carolina.
Share and share alike
Although exhortations to share data often concentrate on the moral advantages of sharing, the practice is not purely altruistic. Researchers who share get plenty of personal benefits, including more connections with colleagues, improved visibility and increased citations. The most successful sharers — those whose data are downloaded and cited the most often — get noticed, and their work gets used. For example, one of the most popular data sets on multidisciplinary repository Dryad is about wood density around the world; it has been downloaded 5,700 times. Co-author Amy Zanne, a biologist at George Washington University in Washington DC, thinks that users probably range from climate-change researchers wanting to estimate how much carbon is stored in biomass, to foresters looking for information on different grades of timber. “I would much prefer to have my data used by the maximum number of people to ask their own questions,” she says. “It's important to allow readers and reviewers to see exactly how you arrive at your results. Publishing data and code allows your science to be reproducible.”
Even people whose data are less popular can benefit, adds Piwowar. By making the effort to organize and label files so that others can understand them, scientists become more organized and better disciplined themselves, and can avoid confusion later on. “It is often very hard to find and understand your own work if you are looking at it years from now,” says Piwowar. Scientists might be inclined to stuff their data into folders that can get lost and muddled — but if they store the files in an online repository, they are forced to curate and collate the data, she says.
Heather Piwowar: “Lots of people are getting into data-hosting, and I think it will be tricky to decide where to put your data.”
The fear of being scooped is a powerful inhibitor. But scientists can put an embargo on their data, so that only they can see the work until they are ready to make it public. And data sets are becoming increasingly citable, bringing their authors formal recognition: data published in a data journal, on Dryad or on the repository figshare.com are given a digital object identifier (DOI) that can be referenced in other publications. (Figshare is owned by Digital Science, a sister company to Nature Publishing Group.)
Would-be sharers often worry that their data are too disordered or shoddy to release into the world. “I make my data available, and it can be a pain. I'm also scared and embarrassed about errors — most of us are, especially early-career scientists,” says Piwowar. “We don't yet have a culture of forgiveness around that, unlike in computer programming, where everyone knows there are bugs in code.” She advises researchers to look into repositories to get a sense of the quality standard for experimental data. “It doesn't have to be perfect,” she says. “It's probably less thorough than you think.”
As sharing grows more common, scientists may worry less about posting data sets. “Ultimately, data will be so ubiquitous that we will no longer be in a world where researchers are so scared,” says Carl Boettiger, an ecologist at the University of California, Santa Cruz, who keeps his entire laboratory notebook open online (see Nature 493, 711; 2013). “At the end of the day, science is a social process. You will never get there hiding yourself and your work,” he adds.
The right place
Depositing data on a personal website is unlikely to be the best way to get it reused and cited. For a start, the website may not be around in five years, says William Michener, director of e-science initiatives at the University of New Mexico in Albuquerque. Michener is principal investigator for a multinational programme called DataONE, which is funded by the US National Science Foundation and promotes best practices to scientists as part of its aim to make data more discoverable. Journal publishers back up their research papers with the help of non-profit archiving services such as Portico and CLOCKSS, which are financed by participating libraries and publishers, and which store material on a number of servers so that it will not disappear if a publisher goes bankrupt. Some data publishers have similar contingency plans, and Piwowar recommends looking into them. If no back-up plans are in place, she says, “it suggests they haven't prioritized well enough how to steward their data”.
Just as important as sharing data publicly is making sure that other researchers can understand them. Susanna Assunta-Sansone, associate director of the Oxford e-Research Centre at the University of Oxford, UK, says that putting out data without noting what it means will ensure that “it's not really reusable”. To avoid this, researchers must choose appropriate metadata: descriptions of the data's content and how they are arranged and set up. This type of curation is useful not just for human readers, but also for computer programmes that might be used to search through or connect data sets. Intelligent searches often rely on whatever descriptive metadata researchers have attached to the data. The metadata are read by an application programming interface (API), a set of commands that computer programmes use to interact with data stores and pull information from them. Not all data repositories use APIs; those that do not may not be the best places to store or release information, because it could be hard for anyone to find.
Sites that are dedicated to hosting particular types of data, such as DNA sequences, usually tell submitters what format is appropriate. They may require data to be entered using an online form or following specific instructions. By contrast, generalist sites — such as institutional repositories, data journals or ventures similar to figshare.com — may have looser requirements. This has the potential to result in a blizzard of different formats and descriptive tags, which could make discovering and reusing data more difficult, so researchers should pay close attention to the norms in their fields.
Decisions about metadata standards should be made early in a research project, says Michener. DataONE has provided a primer on best practices, as has a tool called DataUp, run through the University of California Curation Center in Oakland to help researchers to create data packages that are good enough to put online. Other aspects of data-sharing to consider early on include the information's sensitivity and whether some parts must be stripped out to avoid, for example, identifying human study participants or the locations of endangered species. Researchers also need to be clear about whether they will allow their data sets to be used for any purpose, or whether they would like to limit reuse to, for example, non-commercial applications. One widely understood way of documenting reuse rights is by giving the data one of several different Creative Commons licences.
Ultimately, says Michener, early-career researchers need to pay attention to new and developing ways to share data, and to the standardized formats that are emerging to make data easier to search and discover. Those who do not, he says, should rethink why they are doing research. “I think we are just now reconnecting with what science is all about — not just creating new knowledge, but also sharing the information and data that underpins those discoveries.”