Since 2007 a consortium of research groups has been studying the genomes of two model organisms, the fruitfly Drosophila melanogaster and the nematode worm Caenorhabditis elegans, in a project called model organism encyclopedia of DNA elements (modENCODE)1. The latest results from this project were described recently in two papers in Science2, 3 and a suite of companion papers in Nature and Genome Research (http://blog.modencode.org/papers). The studies report both massive genome-scale data sets and analytic strategies for data integration. They substantially increase the annotated fractions of the fly and worm genomes and provide a wealth of data for understanding these model organisms and for developing new bioinformatic methods. Here we provide an overview of the data and some perspective from scientists on challenges for the field.
The goal of modENCODE is to catalog sequence-based functional DNA elements in the fly and worm genomes. Such a catalog may be used to study regulatory networks and other emergent properties of the genomes, and, perhaps, to better understand the human genome. The project also seeks to generate experimental reagents for use by the research community.
A summary of the new data sets is presented in Tables 1 and 2. To increase the number of functional genomic regions discovered, the studies analyzed organisms at different developmental stages. For the fly, ~700 data sets were generated from whole embryos, larvae and adult female and male insects as well as from a few cell lines and tissues. For the worm, ~240 data sets covered all major developmental stages along with some mutants, isolated tissues and animals exposed to pathogens. For both organisms, microarrays and sequencing were used to characterize gene expression, the binding sites of transcription factors and other proteins associated with DNA, origins of DNA replication, nucleosome turnover rates, salt-fractionated chromatin, the genomic locations of nucleosomes and the sites of different histone modifications.
Box 1: Fruitfly modENCODE
Box 2: Nematode worm modENCODE
Looking forward, projects similar to modENCODE now seem feasible for studying other organisms and a broad range of biological problems. What will be the major challenges of such projects? Not sequencing, says Jun Wang, executive director of BGI in Shenzhen, China. He estimates that generating the equivalent of the fly modENCODE data set using today's technology could take less than 2 months, including less than a month for library construction and a month for sequencing (although in practice more time may be required if several replicates are necessary). The main technical barrier, he says, will be preparing large numbers of samples from different tissues, developmental stages and conditions.
A major challenge will be data integration, says Tom Gingeras of Cold Spring Harbor. For instance, robust integrative approaches are needed that combine genomic, transcriptional, regulatory and epigenomic signals, according to Olga Troyanskaya of Princeton University. Roded Sharan, a computational biologist at Tel Aviv University, agrees, adding that integrative analysis is required to identify an organism's signaling and regulatory pathways and to elucidate how they vary over time and across cell types. He notes that current algorithmic work is focused on analyzing at most a few networks at a time and will have to be significantly scaled up to understand the complex developmental programs of fly or worm. Overall, says Troyanskaya, existing bioinformatic methods are not adequate, suggesting that novel ways of conceptualizing problems may be needed. New methods are needed to deal with the heterogeneity of the data types, to correct for technical and experimental biases and to detect biological signals hidden in experimental noise. Moreover, the sheer volume of data will require new approximation algorithms, computational infrastructure and strategies for disseminating the results.
As a result of modENCODE, the catalog of DNA elements has grown larger, but many questions remain unanswered. The translation of annotated genomes into systems-level descriptions of the fly and the worm is a long-term goal. In worm, the number of candidate noncoding RNAs has increased severalfold, up from 1,061 at the start of the project, but the biological roles of these RNAs are not yet clear. Moreover, “pervasive post-transcriptional regulation of gene expression emerges as a theme from the modENCODE data,” says Thomas Sandmann, a fly geneticist at the German Cancer Research Center. In worm, ~22,000 genes were found to generate ~65,000 different transcripts; in fly, 74% of the ~17,000 genes showed at least one transcript isoform that differed from previous annotations. “Frankly,” says Sandmann, “we don't have a good idea what this complexity is good for or how it works.”
Other open questions involve the evolutionary conservation of DNA functional elements. Manolis Kellis, a member of the modENCODE consortium, explains that a “next step is tackling the comparative analysis of fly and worm to each other and to human, to understand the conservation of the regulatory principles learned, and the relevance of our results to the study of human biology and disease.” Finally, also in progress is ENCODE—a sister project analyzing functional elements in humans—which published results from a pilot project several years ago and is now progressing into its next stage of analysis across the entire human genome.