Rapid release of prepublication data has served the field of genomics well. Attendees at a workshop in Toronto recommend extending the practice to other biological data sets.
Open discussion of ideas and full disclosure of supporting facts are the bedrock for scientific discourse and new developments. Traditionally, published papers combine the salient ideas and the supporting facts in a single discrete 'package'. With the advent of methods for large-scale and high-throughput data analyses, the generation and transmission of the underlying facts are often replaced by an electronic process that involves sending information to and from scientific databases. For such data-intensive projects, the standard requirement is that all relevant data must be made available at a publicly accessible website at the time of a paper's publication1.
One of the lessons from the Human Genome Project (HGP) was the recognition that making data broadly available prior to publication can be profoundly valuable to the scientific enterprise and lead to public benefits. This is particularly the case when there is a community of scientists that can productively use the data quickly — beyond what the data producers could do themselves in a similar time period, and sometimes for scientific purposes outside the original goals of the project.
The principles for rapid release of genome-sequence data from the HGP were first formulated at a meeting held in Bermuda in 1996; these were then implemented as policy by several funding agencies. In exchange for 'early release' of their data, the international sequencing centers retained the right to be the first to describe and analyze their complete datasets in peer-reviewed publications. The draft human genome sequence2 was the highest profile dataset rapidly released before publication, with sequence assemblies greater than 1,000 base pairs usually within 24 hours of generation. This experience ultimately demonstrated that the broad and early availability of sequence data greatly benefited life sciences research by leading to many new insights and discoveries2, including new information on 30 disease genes published prior to the draft sequence.
At a time when advances in DNA sequencing technologies mean that many more laboratories can produce massive datasets, and when an ever-growing number of fields (beyond genome sequencing) are grappling with their own data sharing policies, a Data Release Workshop was convened in Toronto in May 2009 by Genome Canada and other funding agencies. The meeting brought together a diverse and international group of scientists, ethicists, lawyers, journal editors, and funding representatives. The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets – whether from proteomics, biobanking or metabolite research.
By design, the Toronto meeting continued policy discussions from previous meetings, in particular the Bermuda meetings (1996, 1997 and 1998)3,4,5 and the 2003 Fort Lauderdale meeting, which first recommended that rapid pre-publication release be applied to other datasets whose primary utility was a resource for the scientific community, and also established the responsibilities of the resource producers, resource users, and the funding agencies6. A similar 2008 Amsterdam meeting extended the principle of rapid data release to proteomics data7. Although the recommendations of these earlier meetings can apply to many genomics and proteomics projects, many outside the major sequencing centers and funding agencies remain unaware of the details of these policies, and so one goal of the Toronto meeting was to reaffirm the existing principles for early data release with a wider group of stakeholders.
In Toronto, attendees endorsed the value of rapid pre-publication data release for large reference datasets in biology and medicine that have broad utility and agreed that pre-publication data release should go beyond genomics and proteomics studies to other datasets – including chemical structure, metabolomic, and RNAi datasets, and annotated clinical resources (cohorts, tissue banks, and case-control studies). In each of these domains, there are diverse data types and study designs, ranging from the large-scale 'community resource projects' first identified at Fort Lauderdale (for which meeting participants endorsed pre-publication data release) to investigator-led hypothesis-testing projects (for which the minimum standard must be the release of generated data at the time of publication).
Several issues discussed at previous data release meetings were not revisited, as they were considered fundamental to all types of data release (whether pre-publication or publication-associated). These included: specified quality standards for all data; database designs that meet the needs of both data producers and users alike archiving of raw data in a retrievable form; housing of both 'finished' and 'unfinished' data in databases; and provision of long-term support for databases by funding agencies. New issues that were addressed include the importance of simultaneously releasing metadata (such as environmental/experimental conditions and phenotypes) that will enable users to fully exploit the data, as well as the complexities associated with clinical data owing to concerns about privacy and confidentiality (see 'Sharing data about human subjects').
At a practical level, the Toronto meeting developed a set of suggested 'best practices' for funding agencies, for scientists in their different roles (whether as data producers, data analysts/users, and manuscript reviewers), and for journal editors (see 'The Toronto statement').
Recommendations for funders
Funding agencies should require rapid prepublication data release for projects that generate data sets that have broad utility, are large in scale, are 'reference' in character and typically have community 'buy-in'. The table provides examples of projects using different designs, technologies, and approaches that have several of these attributes, but also lists projects that are more hypothesis-based for which prepublication data release should not be mandated.
It was agreed at the meeting that the requirements for prepublication data release must be made clear when funding opportunities are first announced and that proactive engagement of funders is beneficial throughout a project, as has been the experience of many genome-sequencing efforts, the International HapMap Project, the ENCODE project, the 1000 Genomes project and, more recently, the International Cancer Genome Consortium, the Human Microbiome Project and the MetaHIT project.
For all projects generating large data sets, the Toronto meeting recommended that funding agencies require that data-sharing plans be presented as part of grant applications and that these plans are subjected to peer review. Such practice is currently the exception rather than the rule. Funding agencies will need to exercise flexibility by, for example, recognizing that large-scale data-generation projects need not necessarily lead to traditional publications, and that certain projects may need to release only some of their generated data before publication. At the same time, general consistency in data-sharing policies between funding agencies is desirable, whenever possible. To encourage compliance, funding agencies and academic institutions should give credit to investigators who adopt prepublication data-release practices, one option would be to recognize good data-release behaviour during grant renewals and promotion processes, another would be to track the usage and citation of data sets using electronic systems similar to those used for traditional publications8.
Data producers and data users
Early data release can lead to tensions between the interests of the data-producing scientists who request the right to publish a first description of a data set and other scientists who wish to publish their own analyses of the same data. To date, many papers have been published by third parties reporting research findings enabled by data sets released before publication. The experiences shared in Toronto suggest that these have rarely affected subsequent publications authored by the data producers. Nevertheless, the Toronto meeting participants recognized that this is an ongoing concern that is best addressed by fostering a scientific culture that encourages transparent and explicit cooperation on the part of data producers, data analysts, reviewers and journal editors.
Data producers should, as early as possible, and ideally before large-scale data generation begins, clarify their overall intentions for data analysis by providing a citable statement, typically a 'marker paper', that would be associated with their database entries. This statement should provide clear details about the data set to be produced, the associated metadata, the experimental design, pilot data, data standards, security, quality-control procedures, expected timelines, data-release mechanisms and contact details for lead investigators. If data producers request a protected time period to allow them to be the first to publish the data set, this should be limited to global analyses of the data and ideally expire within one year.
If the citable statement is a 'marker paper' it should be subjected to peer review and published in a scientific journal. Alternatively, other citable sources, such as digital object identifiers to specific pages on well-maintained funding agency or institutional websites, could also be used. Data producers benefit from creating a citable reference, as it can later be used to reflect impact of the data sets8.
In turn, the data users should carefully read the source information, including any marker papers, associated with a released data set. Data analysts should pay particular attention to any caveats about data quality, because rapidly released data are often unstable, in that they may not yet have been subjected to full quality control and so may change. It would be prudent for data analysts to assess the benefits and potential problems in immediately analysing released data. They should communicate with data producers to clarify issues of data quality in relation to the intended analyses, whenever possible. In addition, data users should be aware that some data sets are associated with version numbers: the appropriate version number should be tracked and then provided in any published analyses of those data.
Resulting papers describing studies that do not overlap with the intentions stated by the data producers in the marker paper (or other citable source) may be submitted for publication at any time, but must appropriately cite the data source. Papers describing studies that do overlap with the data producer's proposed analyses should be handled carefully and respectfully, ideally including a dialogue with the data producer to see if a mutually agreeable publication schedule (such as co-publication or inclusion within a set of companion papers) can be developed. In this regard, it is important for data users to realize that, historically, many such dialogues have led to coordinated publications and to new scientific insights. Despite the best intentions of all parties, on occasion a researcher may publish the results of analyses that overlap with the planned studies of the data producer. Although such instances are hopefully rare if good communication protocols are followed, these should be viewed as a small risk to the data producers, one that comes with the much greater overall benefit of early data release.
Editors and reviewers
As reviewers of manuscripts submitted for publication, scientists should be mindful that prepublication data sets are likely to have been released before extensive quality control is performed, and any unnoticed errors may cause problems in the analyses performed by third parties. Where the use of prepublication data is limited or not crucial to a study's conclusions, the reviewers should only expect the normal scientific practice of clear citation and interpretation. However, when the main conclusions of a study rely on a prepublication data set, reviewers should be satisfied that the quality of the data is described and taken into account in the analysis.
Participants at the Toronto meeting recommended that journals play an active part in the dialogue about rapid prepublication data release (both in their formal guide to authors and informal instructions to reviewers). Journal editors should remind reviewers that large-scale data sets may be subject to specific policies regarding how to cite and use the data. Ultimately, journal editors must rely on their reviewers' recommendations for reaching decisions about publication. However, encouraging reviewers to carefully check the conditions for using data that authors have not created themselves can help to raise both the quality of analysis and fairness in citation of published studies.
The rapid prepublication release of sequencing data has served the field of genomics well. The Toronto meeting participants acknowledged that policies for prepublication release of data need to evolve with the changing research landscape, that there is a range of opinion in the scientific community, and that actual community behaviour (as opposed to intentions) need to be reviewed on a regular basis. To this end, we encourage readers to join the debate over data-sharing principles and practice in an online forum hosted at http://tinyurl.com/lqxpg3.
A complete list of authors and their affiliations.
Join the discussion at http://tinyurl.com/lqxpg3
See online special at http://tinyurl.com/dataspecial