More data than we can handle is no excuse to give up our efforts to promote data access, but it may make us think about new ways to make it sustainable.
Data sharing may be a misnomer for what we are doing. In science, data are either traded in a market for information (explicitly or implicitly) or consigned to write-only databases. We should perhaps discuss incentives to an effective ideas market rather than restrict our discussion to ensuring open data deposition.
One such approach is to construct an open data commons populated with rich precompetitive datasets chosen for fruitful re-use by a larger community. This vision of the Sage Bionetworks Commons and Platform is presented by Jonathan Derry and colleagues as a perspective in draft for your suggestions and critical evaluation (http://precedings.nature.com/documents/5883/version/1).
Data have hitherto been often inaccessible or not usefully presented because some data producers do not want data to be used by everyone. There is a natural barter process among data holders that generates skeptical evaluation as well as discoveries. Whether this process among privileged labs is more productive than the scrutiny of many eyes on open data is not for discussion. We clearly need both. What has not been widely appreciated is that funders with their data access mandates risk short circuiting the economy of knowledge production. It also costs to format and curate data, and those costs are not fully borne by funders because they have no metrics with which to judge curation efforts, and they do not want endless commitment to resources that may not be used.
NCBI and EBI have been the trusted and accountable partners researchers have relied upon for sustainable data storage, without which we would be hard pressed to promote any data access model. However, recently, NCBI has closed two of its repositories for raw nucleotide sequence data (http://www.ncbi.nlm.nih.gov/sra), mainly because of an explosion of next-generation sequencing runs that cannot be readily reduced to unambiguous calls. There are alternative places to store data. Until SRA returns, journals may have to trust the stability of the links to institutional databases that authors provide and handle the complaints of frustrated data users. To this end, it may help to have data producers publish citable data management plans explaining how to access and use the data. Cloud computing may eventually provide a solution to sequence data storage provided there is a suitable business model. Providers will only keep the data from which they can make money and reputation.
There is hope that we can arrive at solutions because we share common interests in promoting data access. The principle that your reputation is made in the labs of others means that good citizenship is good business. You simply cannot publish enough papers on your data yourself to equal the productivity of the researchers you inspire. The interests of the journals you publish in and the institutions and agencies that fund your work are likewise aligned to do everything they can to enable data sharing by their need to demonstrate that they are contributing to the impact of the datasets you produce.
The sustainability of data access is often discussed by publishers of journals of record who have in part ensured the stable accessibility of the (albeit smaller) datasets of the past. Journals could step up and charge depositors what it really costs to make a large dataset accessible in perpetuity. If they do charge users for access, the price should be transparently related to distribution costs and the need to sustain the archive. Maybe sequence data will not accumulate exponentially forever. Simple discounting suggests that it will be cheaper to re-sequence genomes than to store existing reads. It may be that many large datasets are not really useful for research but are consigned to public databases as merely the burn-ins for technology that moves on. Still, unless we develop suitable metrics for data citation and promote their adoption, the experiment to evaluate the utility of data has not been done. Maybe the funders' need for data to be useful coupled with the incentive for publishers to make open access sustainable would provide the motive to do this properly.
Other incentives that can help with data access are to link author and contributor roles to data accessions and to link data accessions semantically into a concept web. Attribution licenses for articles and data are a good concept but lack enforcement. Attribution is also currently insufficiently granular both at the data level and with respect to the author roles. There are no agreed data citation metrics and examples of resource reallocation or career decisions to point to.
In discussions about ORCID (http://www.orcid.org/) and researcher disambiguation, it is essential that we discuss distributed as well as centralized ways for researchers to track and display their career achievements, connections and productivity. Popular sites like PubMed and Wikipedia provide places to start developing metrics, but it is important to give researchers a choice of individual, institutional, funder, journal and consortium sites to choose from and to agree on what we are counting.
About this article
Cite this article
No second thoughts about data access. Nat Genet 43, 389 (2011). https://doi.org/10.1038/ng.827