The identification of co-regulated genes and their
transcription-factor binding sites (TFBS) are key steps toward understanding
transcription regulation. In addition to effective laboratory assays, various
computational approaches for the detection of TFBS in promoter regions of
coexpressed genes have been developed. The availability of complete genome
sequences combined with the likelihood that transcription factors and their
cognate sites are often conserved during evolution has led to the development
of phylogenetic footprinting1,
2. The modus operandi of
this technique is to search for conserved motifs upstream of orthologous genes
from closely related species1,
2. The method can identify
hundreds of TFBS without prior knowledge of co-regulation or coexpression.
Because many of these predicted sites are likely to be bound by the same
transcription factor, motifs with similar patterns can be put into clusters so
as to infer the sets of co-regulated genes, that is, the regulons. This
strategy utilizes only genome sequence information and is complementary to and
confirmative of gene expression data generated by microarray experiments.
However, the limited data available to characterize individual binding
patterns, the variation in motif alignment, motif width, and base conservation,
and the lack of knowledge of the number and sizes of regulons make this
inference problem difficult. We have developed a Gibbs sampling-based3 Bayesian motif clustering (BMC) algorithm to address these
challenges. Tests on simulated data sets show that BMC produces many fewer
errors than hierarchical and K-means clustering methods4. The
application of BMC to hundreds of predicted -proteobacterial motifs2 correctly identified many experimentally reported regulons, inferred
the existence of previously unreported members of these regulons, and suggested
novel regulons.