Publication of ENCODE data drives innovation in data mining.
There can be few scientists who have not used a brightly coloured highlighter pen to mark the most interesting parts of a research paper, report, proposal or (librarians look away) book. It is a natural reaction when faced with a swamp of information — to build islands of focus that can be identified and linked, both in print and in the mind.
This week, Nature introduces a new concept in the publishing and dissemination of scientific information: one that is a response to the increasing complexity of modern research, and one that draws heavily on the contribution of the humble highlighter.
Starting on page 45, we publish a package of material that centres on the results from the ENCODE consortium, including 6 of the 30 papers the project has produced. The ENCODE — Encyclopedia of DNA Elements — consortium set out to describe all the functional elements in the human genome. Their headline conclusion: more than 80% of the human genome's components have now been assigned at least one biochemical function.
The six papers that Nature publishes (the others appear simultaneously in Genome Research and Genome Biology) may look like conventional research reports, but in the digital world they begin to take on new form — as themed threads. If you are reading this online, then click on the link. If you are reading it in print, then have a look at the version on Nature's ENCODE explorer website (www.nature.com/encode) or, better still, the iPad app.
“Scientists who work on other data-rich and analysis-heavy projects should take note.”
As part of the publication process, the ENCODE authors asked for something extra: to select and package together the sections from each paper that will be of particular interest to scientists in various and varied fields. Just as a postdoctoral researcher looking at transcription factors would use a highlighter to mark up different bits of the papers from, say, a colleague looking at DNA methylation, so the ENCODE authors thought that researchers across the biological spectrum would want to be able to pull together pieces from each of the digital versions that were of specific interest to them. Our editors agreed, and the result is 13 online threads — biological themes that contain no original material but instead harvest and combine related paragraphs, figures and tables from the 30 papers.
The threads, we hope, will help readers to make sense of the dizzying amounts of data produced during the five years of the main ENCODE effort. And they should allow scientists to exploit more easily the information in their own studies, and that, after all, was the point of the project in the first place. Presented online, the threads are filled with links that allow readers to jump easily from paper to paper, to see where the information comes from and how the data are interconnected.
Alongside the thread concept, the ENCODE package introduces another technical innovation, at least one new to Nature. Using a 'virtual machine', online readers can access software designed to perform set computational functions on some of the ENCODE data.
The idea is to allow readers to recreate the analyses behind the specific aspects of the paper and to see how the outcome changes when specific parameters are tweaked. Think of it as a bridge that links the data, the analysis and the relevant description and discussion in the formal papers.
We are eager to hear what readers and users of the material think of these approaches. If they are useful, and early feedback suggests that they will be, then scientists who work on other similarly data-rich and analysis-heavy projects should take note. Results from projects that aim to sequence the human microbiome or different forms of cancer, for example, produce piles of data that could be split along many different themes, and so divided into threads. In many cases the true hard work — the science — is done. Threads, then, are just a different way to package the results.
Some practical problems remain in applying these ideas more widely. The thread concept depends on cooperation between publishers, as well as open access to the papers and appropriate copyright agreements. And the virtual machine demands well curated data that are available to all.
Why are there 13 ENCODE threads? Good question, there could have been many more — as many as there are questions raised in the minds of scientists by the mass of information that the project has placed at their disposal. If your particular interest or angle is not already selected and presented as a theme, then apologies — there is always the old-fashioned route: print the papers and attack them with a highlighter.