The identification of transcription-factor binding sites (TFBS) is essential for deciphering gene regulatory networks. But the complexity of tissue-specific gene regulation makes the identification of DNA-binding sites for unknown regulatory factors very tricky, particularly in vertebrates. Binding-site motifs that are involved in tissue-specific gene expression are common among the promoters of genes that are expressed in the same tissue, but not among promoters that control gene expression in other tissues. Michael Zhang and colleagues have developed a computational method that searches for highly degenerate TFBS motifs (and motif combinations) that are overrepresented in the promoters of tissue-specific genes, relative to genes that are not expressed in that tissue.

Degenerate motifs cannot adequately be described by a consensus sequence, so they are described instead by a scoring matrix, which indicates how often a specific nucleotide is found at a specific position within the motif. The researchers used an approach they called DME (discriminating matrix enumerator) to sequentially test each possible matrix and rank it according to how well it discriminates one set of promoters from another set (or how much the motif is overrepresented in one set of promoters relative to another).

Zhang and colleagues searched promoter sequences of vertebrate liver-specific genes, comparing a 'foreground' promoter set — the liver-selective promoter set (LSPS) of non-homologous promoters — with a 'background' vertebrate subset of promoters from the Eukaryotic Promoter Database (EPD), from which the promoters associated with liver had been removed. Reassuringly, many of the most overrepresented motifs they recovered were remarkably similar to those already known to bind to well-characterized liver-specific transcription factors. Likewise, when they searched for muscle-specific motifs, they found several that were similar to well-known muscle-specific TFBS.

The authors concluded that their method can accurately identify, or give a better description of, known TFBS, as well as previously uncharacterized motifs. However, they note that the choice of the background set used in this analysis needs to be guided carefully by the hypothesis being tested; the sequence properties of the chosen background set will influence the types of motifs picked up in the analysis. Nonetheless, the authors conclude that there is now sufficient sequence and expression data available for large-scale computational studies of tissue-specific TFBS, and that DME is sufficiently accurate to be used in such efforts.