The advent of whole-genome approaches, microarray technologies and improved computation has given us new insights into the regulation of gene expression, although relating the expression of regulatory genes to that of the genes under their control remains difficult. Beer and Tavazoie have recently reported on a systematic approach to the problem, based on the identification of 5′-upstream DNA sequences of the genes of interest.

The relationship between mRNA levels of a transcription factor and the gene it regulates might not be direct owing to, for example, post-transcriptional regulation. But correlating the abundance of the active transcription-factor protein to the mRNA level of the gene it regulates is technically challenging. Fortunately, the new method of Beer and Tavazoie sidesteps these difficulties by building sequence-to-gene networks, using short 5′-upstream DNA sequence elements as a surrogate for active transcription-factor protein levels.

Working in yeast, the authors began by looking for groups of genes that were coexpressed under a range of experimental conditions (for example, heat shock or diamide treatment). In all, they assigned 2,587 genes to 49 'expression patterns'. Next, they identified overrepresented sequence motifs within 800 bp upstream of the genes in each pattern. The rationale was that such sequence elements were likely to be involved in the regulation of the corresponding genes. Indeed, many of the motifs that were pinpointed closely matched known regulatory elements. The authors used a Bayesian approach to apply further constraints to the motifs, such as orientation and distance to ATG, so that regulatory 'rules' could be inferred. They also took into account combinations of motifs. For example, the two elements PAC and RRPE were both found upstream of a high proportion of genes in a particular expression pattern, indicating that they coregulate genes in this group. It emerged that the order and distance between the two elements also strongly affected the degree of correlation between genes.

So, having identifed the upstream sequence elements that are involved in transcriptional regulation, along with the positional and combinatorial constraints that govern their role, the authors tested the predictive power of their approach. Based on promoter sequences alone, they attempted to predict the expression patterns of 'test' sets of genes, not used while learning the rules. Impressively, their predictions were accurate in 73% of the genes, and fine-tuning of the system is likely to improve on this.

As high-quality mRNA expression data becomes readily available in different organisms, the process reported here will be invaluable to our understanding of the regulation of genes, and of cellular behaviour more generally. The authors have already begun to apply their new approach to multicellular organisms, starting with Caenorhabditis elegans. In a preliminary study using expression data collected over a time course from oocyte to adult, they were able to predict the expression patterns of half the genes. As these studies are expanded to take into account further complexities, such as downstream or intronic regulatory elements, it should become possible to unravel the transcriptional regulatory mechanisms behind diverse spatiotemporal processes.