Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Applied bioinformatics for the identification of regulatory elements

Key Points

  • Promoter prediction software can succeed for the 50% of genes with CpG islands or for genes with abundant transcript data.

  • Predictions of individual transcription-factor binding sites (TFBSs) are unreliable owing to the promiscuous binding of transcription factors.

  • Comparative genome sequence analysis (phylogenetic footprinting) can eliminate up to 90% of false binding-site predictions; however, true sites are still obscured by the false predictions.

  • Analysis of clusters of TFBSs in cis-regulatory modules can generate reliable predictions of regulatory regions.

  • New methods are emerging to improve the detection of sequences that regulate gene transcription.

Abstract

The compilation of multiple metazoan genome sequences and the deluge of large-scale expression data have combined to motivate the maturation of bioinformatics methods for the analysis of sequences that regulate gene transcription. Historically, these bioinformatics methods have been plagued by poor predictive specificity, but new bioinformatics algorithms that accelerate the identification of regulatory regions are drawing disgruntled users back to their keyboards. However, these new approaches and software are not without problems. Here, we introduce the purpose and mechanisms of the leading algorithms, with a particular emphasis on metazoan sequence analysis. We identify key issues that users should take into consideration in interpreting the results and provide an online training example to help researchers who wish to test online tools before taking an independent foray into the bioinformatics of transcription regulation.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Figure 1: Components of transcriptional regulation.

References

  1. Alberts, B (ed.). et al. Molecular Biology of the Cell 4th edn (Garland Science, New York, 2002).

    Google Scholar 

  2. Davidson, E. H. Genomic regulatory systems: development and evolution (Academic, San Diego, 2001).

    Google Scholar 

  3. Greenbaum, D., Jansen, R. & Gerstein, M. Analysis of mRNA expression and protein abundance data: an approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts. Bioinformatics 18, 585–596 (2002).

    CAS  PubMed  Article  Google Scholar 

  4. Schmid, C. D., Praz, V., Delorenzi, M., Perier, R. & Bucher, P. The Eukaryotic Promoter Database EPD: the impact of in silico primer extension. Nucleic Acids Res. 32, D82–D85 (2004).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  5. Fickett, J. W. & Hatzigeorgiou, A. G. Eukaryotic promoter recognition. Genome Res. 7, 861–878 (1997). Demonstrated the poor performance of promoter-prediction software. Led to a shift from predicting specific transcription start sites, and towards prediction of regions that are likely to contain a TSS.

    CAS  PubMed  Article  Google Scholar 

  6. Bucher, P. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 212, 563–578 (1990).

    CAS  PubMed  Article  Google Scholar 

  7. Antequera, F. Structure, function and evolution of CpG island promoters. Cell. Mol. Life Sci. 60, 1647–1658 (2003).

    CAS  PubMed  Article  Google Scholar 

  8. Hannenhalli, S. & Levy, S. Promoter prediction in the human genome. Bioinformatics 17 (Suppl. 1), S90–S96 (2001).

    PubMed  Article  Google Scholar 

  9. Down, T. A. & Hubbard, T. J. Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res. 12, 458–461 (2002).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  10. Davuluri, R. V., Grosse, I. & Zhang, M. Q. Computational identification of promoters and first exons in the human genome. Nature Genet. 29, 412–417 (2001).

    CAS  PubMed  Article  Google Scholar 

  11. Adachi, N. & Lieber, M. R. Bidirectional gene organization: a common architectural feature of the human genome. Cell 109, 807–809 (2002).

    CAS  PubMed  Article  Google Scholar 

  12. Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol. 196, 261–282 (1987).

    CAS  PubMed  Article  Google Scholar 

  13. Okazaki, Y. et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563–573 (2002).

    PubMed  Article  Google Scholar 

  14. Karolchik, D. et al. The UCSC Genome Browser Database. Nucleic Acids Res. 31, 51–54 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  15. Liu, R. & States, D. J. Consensus promoter identification in the human genome utilizing expressed gene markers and gene modeling. Genome Res. 12, 462–469 (2002).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  16. Suzuki, Y., Yamashita, R., Sugano, S. & Nakai, K. DBTSS, DataBase of Transcriptional Start Sites: progress report 2004. Nucleic Acids Res. 32, D78–D81 (2004).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  17. Shiraki, T. et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl Acad. Sci. USA (2003). Introduces a new method for the identification of TSS on the basis of improved laboratory methods for the generation of full-length cDNAs. The data generated from this method will be important for the identification of alternative promoters.

  18. Ureta-Vidal, A., Ettwiller, L. & Birney, E. Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nature Rev. Genet. 4, 251–262 (2003).

    CAS  PubMed  Article  Google Scholar 

  19. Frazer, K. A., Elnitski, L., Church, D. M., Dubchak, I. & Hardison, R. C. Cross-species sequence comparisons: a review of methods and available resources. Genome Res. 13, 1–12 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  20. Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  21. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).

  22. Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301–1310 (2002).

    CAS  PubMed  Article  Google Scholar 

  23. C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998).

  24. Adams, M. D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).

    PubMed  Article  Google Scholar 

  25. Levy, S. & Hannenhalli, S. Identification of transcription factor binding sites in the human genome sequence. Mamm. Genome 13, 510–514 (2002).

    CAS  PubMed  Article  Google Scholar 

  26. Lenhard, B. et al. Identification of conserved regulatory elements by comparative genome analysis. J. Biol. 2, 13 (2003). Demonstrates that phylogenetic footprinting can eliminate an order of magnitude of false-positive transcription-factor binding-site predictions, in exchange for a modest sensitivity decrease.

    PubMed  PubMed Central  Article  Google Scholar 

  27. Bagheri-Fam, S., Ferraz, C., Demaille, J., Scherer, G. & Pfeifer, D. Comparative genomics of the SOX9 region in human and Fugu rubripes: conservation of short regulatory sequence elements within large intergenic regions. Genomics 78, 73–82 (2001).

    CAS  PubMed  Article  Google Scholar 

  28. Aparicio, S. et al. Detecting conserved regulatory elements with the model genome of the Japanese puffer fish, Fugu rubripes. Proc. Natl Acad. Sci. USA 92, 1684–1688 (1995).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  29. Santini, S., Boore, J. L. & Meyer, A. Evolutionary conservation of regulatory elements in vertebrate Hox gene clusters. Genome Res. 13, 1111–1122 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  30. Tatusov, R. L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).

    PubMed  PubMed Central  Article  Google Scholar 

  31. Storm, C. E. & Sonnhammer, E. L. Comprehensive analysis of orthologous protein domains using the HOPS database. Genome Res. 13, 2353–2362 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  32. Wheeler, D. L. et al. Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res. 32, D35–D40 (2004).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  33. Schwartz, S. et al. Human–mouse alignments with BLASTZ. Genome Res. 13, 103–107 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  34. Brudno, M. et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721–731 (2003). One of the best progressive alignment algorithms for global genome sequence alignment that facilitates phylogenetic footprinting.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  35. Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970).

    CAS  Article  PubMed  Google Scholar 

  36. Brudno, M. et al. Glocal alignment: finding rearrangements during alignment. Bioinformatics 19 (Suppl. 1), I54–I62 (2003).

    PubMed  Article  Google Scholar 

  37. Loots, G. G., Ovcharenko, I., Pachter, L., Dubchak, I. & Rubin, E. M. rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res. 12, 832–839 (2002).

    PubMed  PubMed Central  Article  Google Scholar 

  38. Elnitski, L. et al. PipTools: a computational toolkit to annotate and analyze pairwise comparisons of genomic sequences. Genomics 80, 681–690 (2002).

    CAS  PubMed  Article  Google Scholar 

  39. Elnitski, L. et al. Distinguishing regulatory DNA from neutral sites. Genome Res. 13, 64–72 (2003). A new method to classify functions of conserved regions as regulatory or coding on the basis of the pattern of identical nucleotides.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  40. Thomas, J. W. et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793 (2003). A first look at methods to analyse large sets of orthologous eukaryotic gene sequences.

    CAS  PubMed  Article  Google Scholar 

  41. Montgomery, S. B. et al. Sockeye: A 3D environment for comparative genomics. Genome Res. (in the press).

  42. Davidson, E. H. et al. A genomic regulatory network for development. Science 295, 1669–1678 (2002). One of several papers by Davidson that constructs the argument that genes are regulated by composite interactions of transcription factors that interact with locally dense clusters of binding sites.

    CAS  PubMed  Article  Google Scholar 

  43. Palstra, R. J. et al. The β-globin nuclear compartment in development and erythroid differentiation. Nature Genet. 35, 190–194 (2003).

    CAS  Article  PubMed  Google Scholar 

  44. Fickett, J. W. Quantitative discrimination of MEF2 sites. Mol. Cell Biol. 16, 437–441 (1996).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  45. Fickett, J. W. Coordinate positioning of MEF2 and myogenin binding sites. Gene 172, GC19–GC32 (1996).

    CAS  PubMed  Article  Google Scholar 

  46. Tronche, F., Ringeisen, F., Blumenfeld, M., Yaniv, M. & Pontoglio, M. Analysis of the distribution of binding sites for a tissue-specific transcription factor in the vertebrate genome. J. Mol. Biol. 266, 231–245 (1997). Demonstration that matrix-based profiles for the prediction of transcription-factor binding sites accurately predict in vitro binding.

    CAS  PubMed  Article  Google Scholar 

  47. Pollock, R. & Treisman, R. A sensitive method for the determination of protein-DNA binding specificities. Nucleic Acids Res. 18, 6197–6204 (1990).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  48. Bulyk, M. L., Gentalen, E., Lockhart, D. J. & Church, G. M. Quantifying DNA-protein interactions by double-stranded DNA arrays. Nature Biotechnol. 17, 573–577 (1999).

    CAS  Article  Google Scholar 

  49. Shultzaberger, R. K. & Schneider, T. D. Using sequence logos and information analysis of Lrp DNA binding sites to investigate discrepancies between natural selection and SELEX. Nucleic Acids Res. 27, 882–887 (1999).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  50. Roulet, E. et al. High-throughput SELEX SAGE method for quantitative modeling of transcription-factor binding sites. Nature Biotechnol. 20, 831–835 (2002).

    CAS  Article  Google Scholar 

  51. Stormo, G. D. DNA binding sites: representation and discovery. Bioinformatics 16, 16–23 (2000). An excellent explanation of the relationship between scores that are produced by binding-site profiles and binding energy.

    CAS  PubMed  Article  Google Scholar 

  52. King, O. D. & Roth, F. P. A non-parametric model for transcription factor binding sites. Nucleic Acids Res. 31, e116 (2003).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  53. Berg, O. G. & von Hippel, P. H. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 193, 723–750 (1987).

    CAS  PubMed  Article  Google Scholar 

  54. Udalova, I. A., Mott, R., Field, D. & Kwiatkowski, D. Quantitative prediction of NF-κ B DNA-protein interactions. Proc. Natl Acad. Sci. USA 99, 8167–8172 (2002).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  55. Barash, Y., Elidan, G., Friedman, N. & Kaplan, T. in Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (eds Vingron, M., Istrail, S., Pevzner, P. and Waterman, M.) 28–37 (ACM, New York, 2003).

    Google Scholar 

  56. Benos, P. V., Bulyk, M. L. & Stormo, G. D. Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res. 30, 4442–4451 (2002). Summary of several key papers that demonstrate that matrix profiles provide reasonable predictions of binding sites in most cases.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  57. Owen, G. I. & Zelent, A. Origins and evolutionary diversification of the nuclear receptor superfamily. Cell. Mol. Life Sci. 57, 809–827 (2000).

    CAS  PubMed  Article  Google Scholar 

  58. Roulet, E. et al. Experimental analysis and computer prediction of CTF/NFI transcription factor DNA binding sites. J. Mol. Biol. 297, 833–848 (2000).

    CAS  PubMed  Article  Google Scholar 

  59. Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W. W. & Lenhard, B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32, D91–D94 (2004).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  60. Bray, N., Dubchak, I. & Pachter, L. AVID: a global alignment program. Genome Res. 13, 97–102 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  61. Matys, V. et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31, 374–378 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  62. Lenhard, B. & Wasserman, W. W. TFBS: computational framework for transcription factor binding site analysis. Bioinformatics 18, 1135–1136 (2002).

    CAS  PubMed  Article  Google Scholar 

  63. Dermitzakis, E. T. & Clark, A. G. Evolution of transcription factor binding sites in mammalian gene regulatory regions: conservation and turnover. Mol. Biol. Evol. 19, 1114–1121 (2002).

    CAS  PubMed  Article  Google Scholar 

  64. Wray, G. A. et al. The evolution of transcriptional regulation in eukaryotes. Mol. Biol. Evol. 20, 1377–1419 (2003). An examination of the patterns of sequence evolution in regulatory regions. Surveys the genetic consequences of changes in binding sites.

    CAS  Article  PubMed  Google Scholar 

  65. Tagle, D. A. et al. Embryonic ε- and γ-globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J. Mol. Biol. 203, 439–455 (1988). One of several papers from the group that, to the best of our knowledge, established the phrase 'phylogenetic footprinting'.

    CAS  PubMed  Article  Google Scholar 

  66. Boffelli, D. et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394 (2003).

    CAS  PubMed  Article  Google Scholar 

  67. Wasserman, W. W. & Fickett, J. W. Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol. 278, 167–181 (1998).

    CAS  PubMed  Article  Google Scholar 

  68. Frith, M. C., Li, M. C. & Weng, Z. Cluster-Buster: finding dense clusters of motifs in DNA sequences. Nucleic Acids Res. 31, 3666–3668 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  69. Krivan, W. & Wasserman, W. W. A predictive model for regulatory sequences directing liver-specific transcription. Genome Res. 11, 1559–1566 (2001). Demonstration that coupling module predictions with phylogenetic footprinting can result in reliable predictions of regulatory sequences.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  70. Liu, R., McEachin, R. C. & States, D. J. Computationally identifying novel NF-κ B-regulated immune genes in the human genome. Genome Res. 13, 654–661 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  71. Berman, B. P. et al. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl Acad. Sci. USA 99, 757–762 (2002).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  72. Johansson, O., Alkema, W., Wasserman, W. W. & Lagergren, J. Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithm. Bioinformatics 19 (Suppl. 1), I169–I176 (2003).

    PubMed  Article  Google Scholar 

  73. Bailey, T. L. & Noble, W. S. Searching for statistically significant regulatory modules. Bioinformatics 19 (Suppl. 2), II16–II25 (2003).

    PubMed  Google Scholar 

  74. Aerts, S., Van Loo, P., Thijs, G., Moreau, Y. & De Moor, B. Computational detection of cis-regulatory modules. Bioinformatics 19 (Suppl. 2), II5–II14 (2003).

    PubMed  Google Scholar 

  75. Rajewsky, N., Vergassola, M., Gaul, U. & Siggia, E. D. Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics 3, 30 (2002). An excellent algorithm for the detection of locally dense clusters of transcription-factor binding sites, particularly orientated towards large clusters of sites for a single factor.

    PubMed  PubMed Central  Article  Google Scholar 

  76. Lifanov, A. P., Makeev, V. J., Nazina, A. G. & Papatsenko, D. A. Homotypic regulatory clusters in Drosophila. Genome Res. 13, 579–588 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  77. Sandelin, A. & Wasserman, W. W. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J. Mol. Biol. (in the press).

  78. Gelfand, M. S., Novichkov, P. S., Novichkova, E. S. & Mironov, A. A. Comparative analysis of regulatory patterns in bacterial genomes. Brief Bioinform. 1, 357–371 (2000).

    CAS  PubMed  Article  Google Scholar 

  79. Cliften, P. et al. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301, 71–76 (2003).

    CAS  PubMed  Article  Google Scholar 

  80. Aerts, S. et al. Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res. 31, 1753–1764 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  81. Vadigepalli, R., Chakravarthula, P., Zak, D. E., Schwaber, J. S. & Gonye, G. E. PAINT: a promoter analysis and interaction network generation tool for gene regulatory network identification. Omics 7, 235–252 (2003).

    CAS  PubMed  Article  Google Scholar 

  82. Klingenhoff, A., Frech, K., Quandt, K. & Werner, T. Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity. Bioinformatics 15, 180–186 (1999).

    CAS  PubMed  Article  Google Scholar 

  83. Berezikov, E., Guryev, V., Plasterk, R. H. & Cuppen, E. CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting. Genome Res. 14, 170–178 (2004).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  84. Kel-Margoulis, O. V., Ivanova, T. G., Wingender, E. & Kel, A. E. Automatic annotation of genomic regulatory sequences by searching for composite clusters. Pac. Symp. Biocomput. 187–198 (2002).

  85. Sharan, R., Ovcharenko, I., Ben-Hur, A. & Karp, R. M. CRéME: a framework for identifying cis-regulatory modules in human–mouse conserved segments. Bioinformatics 19 (Suppl. 1), I283–I291 (2003).

    PubMed  Article  Google Scholar 

  86. Felsenfeld, G. Quantitative approaches to problems of eukaryotic gene expression. Biophys. Chem. 100, 607–613 (2003).

    CAS  PubMed  Article  Google Scholar 

  87. O'Brien, T. P. et al. Genome function and nuclear architecture: from gene expression to nanoscience. Genome Res. 13, 1029–1241 (2003).

    CAS  PubMed  Article  Google Scholar 

  88. Levitsky, V. G., Podkolodnaya, O. A., Kolchanov, N. A. & Podkolodny, N. L. Nucleosome formation potential of eukaryotic DNA: calculation and promoters analysis. Bioinformatics 17, 998–1010 (2001).

    CAS  PubMed  Article  Google Scholar 

  89. Shannon, M. F. & Rao, S. Transcription: of chips and ChIPs. Science 296, 666–669 (2002).

    CAS  PubMed  Article  Google Scholar 

  90. Gerasimova, T. I. & Corces, V. G. Chromatin insulators and boundaries: effects on transcription and nuclear organization. Annu. Rev. Genet. 35, 193–208 (2001).

    CAS  PubMed  Article  Google Scholar 

  91. West, A. G., Gaszner, M. & Felsenfeld, G. Insulators: many functions, many mechanisms. Genes Dev. 16, 271–288 (2002).

    PubMed  Article  CAS  Google Scholar 

  92. Schneider, T. D. & Stephens, R. M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 6097–6100 (1990).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  93. Christensen, T. H., Prentice, H., Gahlmann, R. & Kedes, L. Regulation of the human cardiac/slow-twitch troponin C gene by multiple, cooperative, cell-type-specific, and MyoD-responsive elements. Mol. Cell Biol. 13, 6752–6765 (1993).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  94. Parmacek, M. S. et al. A novel myogenic regulatory circuit controls slow/cardiac troponin C gene transcription in skeletal muscle. Mol. Cell Biol. 14, 1870–1885 (1994).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  95. Kel, A. E. et al. MATCH: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 31, 3576–3579 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  96. Clamp, M. et al. Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res. 31, 38–42 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  97. Lee, Y. et al. Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res. 12, 493–502 (2002).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  98. Hollich, V., Storm, C. E. & Sonnhammer, E. L. OrthoGUI: graphical presentation of Orthostrapper results. Bioinformatics 18, 1272–1273 (2002).

    CAS  PubMed  Article  Google Scholar 

Download references

Acknowledgements

W.W.W. is supported by a grant from the Canadian Institutes of Health Research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wyeth W. Wasserman.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

DATABASES

LocusLink

endo16

TNNC1

FURTHER INFORMATION

Accompanying online exercises

Glossary

ORTHOLOGY

Two sequences are orthologous if they share a common ancestor and are separated by speciation.

PHYLOGENETIC FOOTPRINTING

An approach that seeks to identify conserved regulatory elements by comparing genomic sequences between related species.

MACHINE LEARNING

The ability of a program to learn from experience — that is, to modify its execution on the basis of newly acquired information. In bioinformatics, neural networks and Monte Carlo Markov Chains are well-known examples.

NEURAL NETWORK

A machine-learning technique that simulates a network of communicating nerve cells.

CAGE

(Cap analysis of gene expression). The high-throughput sequencing of concatamers of DNA tags that are derived from the initial nucleotides of 5′ mRNA.

SAGE

(Serial analysis of gene expression). A method for quantitative and simultaneous analysis of a large number of transcripts; short sequence tags are isolated, concentrated and cloned; their sequencing reveals a gene-expression pattern that is characteristic of the tissue or cell type from which the tags were isolated.

LOCAL ALIGNMENT

The detection of local similarities between two sequences.

GLOBAL ALIGNMENT

The alignment of two sequences over their full length.

NEEDLEMAN–WUNSCH ALGORITHM

A commonly used algorithm in bioinformatics that produces a global alignment of two sequences. The term 'global' refers to alignments across the entirety of the sequences. The algorithm returns an optimal alignment, in which 'optimal' refers to the highest possible score under a specific scoring system. The algorithm is computationally demanding, restricting its direct application to sequences of modest length.

HIDDEN MARKOV MODEL

(HMM). A probabilistic model for the recognition of patterns in DNA or protein sequences. HMMs represent a system as a set of discrete states and as transitions between those states. Each transition has an associated probability, which can be readily derived from training sets, such as alignments of known examples of a pattern. HMMs are valuable because they enable a search or alignment algorithm to be built on firm probabilistic bases.

FUTILITY THEOREM

The authors' assertion that essentially all predicted transcription-factor (TF) binding sites that are generated with models for the binding of individual TFs will have no functional role.

SELEX

(Systematic evolution of ligands by exponential amplification). A set of laboratory procedures for the identification of representative sets of ligands for a protein. In the case of DNA-binding proteins, the protein is mixed with a pool of double-stranded oligonucleotides that contain a random core of nucleotides flanked by specific sequences. The protein in complex with bound DNA is recovered and the ligands are subsequently amplified by PCR. The recovered oligonucleotides are sequenced and analysed to reveal the binding specificity of the protein.

INFORMATION CONTENT

A measure of nucleotide conservation in a position, based on information theory.

PSEUDOCOUNT

The sample correction that is added when assessing the probability to correct for small sample sizes (that is, few binding sites).

HOMOTYPIC CLUSTER

A cluster of similar transcription-factor (TF) binding sites, often binding the same TF.

BAYESIAN [METHOD]

A statistical method of combining the likelihood with additional information to produce an overall estimate of the strength of a piece of evidence.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Wasserman, W., Sandelin, A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5, 276–287 (2004). https://doi.org/10.1038/nrg1315

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg1315

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing