We have found that ongoing financial investment in data-archiving infrastructure yields an impressive scientific return, and believe that it should be whole-heartedly supported by research funding agencies (see, for example, http://go.nature.com/nzftf3).
We used Dryad (see http://datadryad.org), an international, open, cost-effective data repository for the biological sciences, to estimate the cost of archiving data from more than 10,000 publications. We found that these could be curated and the data preserved at an annual cost of about US$400,000.
As an example of how much research is typically published per grant dollar, core grants in population and community ecology from the US National Science Foundation averaged 3–4 publications per $100,000 of grant between 2000 and 2005 (S. Reyes, A. Tessier and S. Mazer, unpublished results). That is, $400,000 invested in original research resulted in about 16 papers.
Dryad cannot yet tell us how effective data archives are in facilitating primary research publications, but the Gene Expression Omnibus (GEO) database at the US National Center for Biotechnology Information offers some insight. To estimate data reuse, we searched the full text of articles in PubMed Central for mention of any of the 2,711 data sets deposited in GEO in 2007. We excluded articles whose authors' names overlapped with those depositing the data set. Extrapolating the 338 hits in PubMed Central to all of PubMed, we estimate that the GEO 2007 data sets made third-party contributions to more than 1,150 published articles by the end of 2010, and reuse continues to accumulate rapidly (2011). , and Dryad Digital Repository doi:10.5061/dryad.j1fd7;
Assuming that Dryad has a comparable rate of reuse and collects at least 2,500 data sets annually, an investment of $400,000 in one year should contribute to more than 1,000 papers in the next four years — far more than the accepted value for a research dollar.