The NIH has a unique opportunity—as research sponsor, as steward of the peer-review process for awarding research funding, and as the major public library for access to research results. —Data and Informatics Working Group, Draft Report to the Advisory Committee of the Director (http://acd.od.nih.gov/diwg.htm)

We agree with this assessment in last year's Data and Informatics Working Group (DIWG) report to the NIH director's advisory committee. Grasping this opportunity will require at least a central registry of research and clinical data sets and a minimal metadata schema for the description of the conditions under which they were produced. Taking the opportunity further could entail central public storage of all publicly funded research data and analytical tools, but there we disagree and think that is a recipe for burying biomedically useful insights in noise. Signals can be filtered from the noise by enterprising individuals who will be attracted by the opportunity to use data and to design solutions for its management, with this use regulated by standards set by the NIH to invite innovation.

We want the NIH to fund research, set a minimum standard for the registration of metadata, clearly state the rules for access to data from large-resource projects they fund by including data management plans in the peer review of resource grants and fund training in data analysis. If it does anything else to promote research on large data sets, it should not aspire to become the sole centralized provider but should rather provide application programming interfaces to stimulate problem solving by the user community of researchers, institutions and private-sector individuals attracted by competition to provide solutions on an open and interoperable set of standards.

We suggest that it might be productive to leave open—for competition and tender—the services needed to support peer review of data quality and publication of metadata (for example, we offer NPG's Scientific Data service), tool development, training course design and delivery, and data metrics incentivizing data deposition and access. In the era of cloud computing, even data curation, storage and security can be open for contract.

There are two competing views as to whether large data by itself does change the practice of science.

The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all. —C. Anderson, (Wired 06.23.08, 2008).

Chris Anderson, the editor of Wired magazine, wrote in 2008 that the sheer volume of data would obviate the need for theory, and even the scientific method. [...] But... these views are badly mistaken. The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning. [...] Big Data will produce progress—eventually. How quickly it does, and whether we regress in the meantime, will depend on us. [...] Our biological instincts are not always very well adapted to the information-rich modern world. Unless we work actively to become aware of the biases we introduce, the returns to additional information may be minimal—or diminishing. [...] Meanwhile, if the quantity of information is increasing by 2.5 quintillion bytes per day, the amount of useful information almost certainly isn't. Most of it is just noise, and the noise is increasing faster than the signal. —Nate Silver, The Signal and the Noise (Penguin Press, New York, 2012)

The greatest incentive to make scientific discoveries from data is control. When you have a period of exclusive access to a data set, you can make efficient resource deployment and analytical decisions to search for signal in the noise and test hypotheses in a prioritized way. When you can choose research groups with which to collaborate by their favorable signal-to-noise ratios, you can use their data to replicate results obtained from your own data sets and thereby overcome the problem of models that were overfit to those. Having to strike deals of mutual benefit among producers of data of similar quality and quantity has the advantage of forcing a direct discussion of standards, priorities and transparency among peers.

Conversely, when you are overwhelmed by noise, analytically unprepared for the data deluge, there is an instinct to retain exclusive access to the data set, whether or not it is useful or productive to do so. The fear of missing an important insight may lead to a delay in data sharing that lasts beyond the useful lifetime of the data set.

As most hypotheses and observations are unpublishable without the ability to test and replicate them outside the data set, a delay in sharing results is a lost opportunity to trade with others for a complementary data set. Both of these problems can be addressed by proposals in the DIWG report, namely, through clear rules on sharing data sets from publicly funded projects, tools to navigate large data sets and appropriate training in bioinformatics, statistics and other quantitative methods.