In molecular biology and genetics, high-throughput data gathering has progressed in a short time from a novel strategy of the few to a standard approach of the many.

Much of these data are not collected in a way that guarantees their accessibility to other researchers for future analysis. The best databases, however, are online agglomerations of raw and processed data, analytic tools and insights that have been collectively dubbed 'knowledge environments'.

Funding agencies are slowly adjusting their priorities to reflect the importance of consistent, assured support for some of these databases. The US National Institutes of Health (NIH), for example, has recognized that it must provide guidance for its research sections and grantees so as to maximize the return, in both knowledge and public health, on its investments in biomedical research.

Which strategies best support the collection, analysis and dissemination of large databases of related information? Do knowledge environments promote the aims of the funding agencies? At a meeting in Bethesda, Maryland, last month, it was clear that the NIH is struggling to find a middle road between two diametrically opposed approaches to the development of such databases. Top-down pressure by the agency on researchers to use certain software or formats would probably impede their development. But a bottom-up strategy that merely encourages cross-project cooperation, while allowing researchers total freedom to devise their own databases, is bound to be chaotic, does not guarantee cross-compatibility of data, and ultimately reduces the likelihood that their contents will be used to maximally benefit research.

What is clear is that individual labs can no longer make much progress alone. Currently, many researchers feel they are drowning in data. For all they know, a database might contain answers to patient safety issues or glimmers of new therapeutics — but this is being lost through an inability to effectively harvest the data already available. Other opportunities are missed because both experts and data are 'silo'-ed in isolated and often inaccessible systems. On top of these issues is the fact that neither databases nor the experts that create them are permanent or inseparable.

The NIH and its equivalent agencies elsewhere in the world are now turning their attention to working out how best to assist the growth of validated and accessible databases. This should involve, at the least, development of policies for evaluating proposals on databases and associated analytic tools, for their sustained funding, and for ensuring that the data deposited remain accessible long after the project originators have moved on.

The NIH itself, if it chose, could aim for something grander. It could take it upon itself to define a broad reference model and the basic architecture for knowledge environments. It could even build a centralized warehouse with Google-like storage, a veritable National Biomedical Resource of raw data and the tools to access and analyse them. The challenge then would be how to store, retrieve and read data created on multiple systems by diverse research groups.

But perhaps it will prove more realistic for the US agency to concentrate on improving the inter-operability of databases, rather than pushing for their merger, and to provide incentives for building in 'joins' from the start. The NIH should work on this with industrial companies and other government agencies.

To obtain value for money, it will be vital for funding agencies to carefully select the databases they choose to support and then to support them for the long term.

All funders should also be aware of the need to support viable career paths for the software engineers and bioinformaticians who create the knowledge environments and curate the data in them. And in order to obtain value for money, it will be vital for funding agencies to carefully select the databases they choose to support and then to support them for the long term. They must encourage the sustained availability of these data and build incentives for the development of cross-querying capability.

As things stands, even popular databases such as PhysioNet (, which is facing loss of NIH funding after seven years in development, often lack secure funding. In order to get value, the NIH and other funding agencies need to develop policies for such knowledge environments that will allow long-term support of successful ones.

Researchers also need stronger incentives to sustain their own participation in building knowledge environments. At a minimum, contributors should receive a citable acknowledgement of depositions. Leadership and trust are required to ensure that primary researchers personally benefit from storing their data in open databases.

Michael Huerta, associate director of the National Institute of Mental Health, told those at the Bethesda meeting that the NIH is keen to get more mileage out of high-throughput data acquired with public funds. But that will mean that the agency should require the databases it supports to be equipped with tools and protocols that researchers find convenient to use. Ultimately, these databases must show that their data are being used by other groups, and generating work of publishable quality.