To the Editor:

One perhaps unintended consequence of the success of the human genome project has been a shift in the biomedical research funding landscape toward large-scale programs, commonly involving several hundred scientists and budgets of hundreds of millions of dollars. However, this emphasis on large-scale projects has been questioned, as illustrated by recent debates following last year's publications from the Encyclopedia of DNA Elements (ENCODE) project1,2. Rather than making decisions ahead of time about what data sets should be generated for a given research community, as large-scale projects must do, we have explored an alternative approach, compiling all data sets produced by one such community as soon as they have been deposited in public databases. We demonstrate that the compendium size resulting from such real-time curation can exceed that of large-consortium efforts, thereby providing a highly topical contribution to the ongoing 'small science versus big science' debate.

We created HAEMCODE, a repository for transcription factor (TF)-binding maps in mouse blood cells; the maps are generated from chromatin immunoprecipitation followed by sequencing (ChIP-seq) data. Using a standardized analysis pipeline, we manually curated more than 300 TF ChIP-seq studies from a wide range of primary mouse hematopoietic cells and major cell line models. As of September 2013, the HAEMCODE compendium covered 84 TFs across 24 major blood cell types. Hemopoiesis is also a major focus of ENCODE, yet the currently available mouse ENCODE data (36 TFs; May 2013) cover less than half the HAEMCODE contents, with only 9 TFs investigated by ENCODE not available elsewhere.

We developed a Web interface (http://haemcode.stemcells.cam.ac.uk/) to provide data access as well as a range of online analysis tools that we designed to be useful to both experimentalists and computational biologists. In the classical use case, a user selects experiments within HAEMCODE before being directed to a workspace that offers precomputed options to inspect and/or download selected ChIP-seq data sets. Additional online tools can compute global similarity between selected experiments, investigate overrepresentation of a user-submitted gene list in any subset of ChIP-seq experiments3, inspect precomputed results from de novo motif discovery and output all ChIP-seq experiments with binding peaks for a user-supplied gene locus.

Integration of publicly available data represents a powerful approach to make novel discoveries across diseases, species and platforms that would be impossible to achieve from single projects4. Successful completion of the HAEMCODE project on a small budget highlights this approach as a potentially widely applicable complement to multimillion-dollar research initiatives.