Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Genomic strategies to identify mammalian regulatory sequences

Key Points

  • Computational tools for the analysis of genomic DNA are good at identifying coding sequences, but poor at identifying regulatory sequences.

  • Comparative genomic sequence analysis is a powerful approach for identifying conserved non-coding regions. Many such conserved regions have been shown to be involved in gene regulation.

  • A second approach to identify regulatory regions is to look for sequence motifs known to bind to transcription factors. A number of databases have compiled information on these motifs.

  • A third approach is to use expression profiling to identify regulatory sequences. Co-regulated genes are identified by cluster analysis and their upstream regions are searched for common motifs. This method has been applied most successfully in yeast.

  • These approaches can be combined to yield a powerful strategy for identifying novel regulatory elements, and for decoding the non-coding portion of mammalian genomes.

Abstract

With the continuing accomplishments of the human genome project, high-throughput strategies to identify DNA sequences that are important in mammalian gene regulation are becoming increasingly feasible. In contrast to the historic, labour-intensive, wet-laboratory methods for identifying regulatory sequences, many modern approaches are heavily focused on the computational analysis of large genomic data sets. Data from inter-species genomic sequence comparisons and genome-wide expression profiling, integrated with various computational tools, are poised to contribute to the decoding of genomic sequence and to the identification of those sequences that orchestrate gene regulation. In this review, we highlight several genomic approaches that are being used to identify regulatory sequences in mammalian genomes.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Classification of conserved human–mouse sequences.
Figure 2: Identifying transcription-factor-binding sites.
Figure 3: TRANSFAC and position-weighted matrices.
Figure 4: Combining expression data and sequence conservation.
Figure 5: Annotating the genome.

Similar content being viewed by others

References

  1. Durick, K., Mendlein, J. & Xanthopoulos, K. G. Hunting with traps: genome-wide strategies for gene discovery and functional analysis. Genome Res. 9, 1019–1025 (1999).

    Article  CAS  Google Scholar 

  2. Fukushige, S. & Ikeda, J. E. Trapping of mammalian promoters by Cre-lox site-specific recombination. DNA Res. 3, 73–80 (1996).

    Article  CAS  Google Scholar 

  3. Asoh, S., Lee-Kwon, W., Mouradian, M. M. & Nirenberg, M. Selection of DNA clones with enhancer sequences. Proc. Natl Acad. Sci. USA 91, 6982–6986 ( 1994).

    Article  CAS  Google Scholar 

  4. Duret, L. & Bucher, P. Searching for regulatory elements in human noncoding sequences. Curr. Opin. Struct. Biol. 7, 399–406 (1997).

    Article  CAS  Google Scholar 

  5. Hardison, R. C., Oeltjen, J. & Miller, W. Long human–mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. Genome Res. 7, 959–966 ( 1997).

    Article  CAS  Google Scholar 

  6. Hardison, R. C. Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet. 16, 369–372 (2000).An excellent review of comparative sequence analyses, limitations and successes.

    Article  CAS  Google Scholar 

  7. Loots, G. G. et al. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288, 136–140 (2000).

    Article  CAS  Google Scholar 

  8. Gottgens, B. et al. Analysis of vertebrate SCL loci identifies conserved enhancers . Nature Biotechnol. 18, 181– 186 (2000).References 7 and 8 are early examples of the use of human–mouse comparative sequence analyses for assigning priority to regions of DNA to screen for functional properties.

    Article  CAS  Google Scholar 

  9. Chu, S. et al. The transcriptional program of sporulation in budding yeast. Science 282, 699–705 ( 1998); erratum 282, 1421 ( 1998)

    Article  CAS  Google Scholar 

  10. Spellman, P. T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273–3297 (1998).

    Article  CAS  Google Scholar 

  11. Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J. & Church, G. M. Systematic determination of genetic network architecture. Nature Genet. 22, 281–285 (1999).References 9, 10 and 11 provide excellent examples of yeast microarray data and how they can be used to cluster pathway-related genes on the basis of similar expression patterns.

    Article  CAS  Google Scholar 

  12. Zhu, J. & Zhang, M. Q. Cluster, function and promoter: analysis of yeast expression array. Pac. Symp. Biocomput. 479–490 (2000).

  13. Wasserman, W. W. & Fickett, J. W. Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol. 278, 167–181 (1998).

    Article  CAS  Google Scholar 

  14. Niehrs, C. & Pollet, N. Synexpression groups in eukaryotes . Nature 402, 483–487 (1999).

    Article  CAS  Google Scholar 

  15. Lockhart, D. J. & Winzeler, E. A. Genomics, gene expression and DNA arrays. Nature 405, 827–836 (2000).A significant review of the numerous applications of using DNA arrays to understand biological processes.

    Article  CAS  Google Scholar 

  16. Hughes, J. D., Estep, P. W., Tavazoie, S. & Church, G. M. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 296, 1205–1214 (2000).

    Article  CAS  Google Scholar 

  17. Zhang, M. Q. Promoter analysis of co-regulated genes in the yeast genome. Comput. Chem. 23, 233–250 ( 1999).

    Article  CAS  Google Scholar 

  18. Faisst, S. & Meyer, S. Compilation of vertebrate-encoded transcription factors. Nucleic Acids Res. 20, 3–26 (1992).

    Article  CAS  Google Scholar 

  19. Frech, K., Herrmann, G. & Werner, T. Computer-assisted prediction, classification, and delimitation of protein binding sites in nucleic acids. Nucleic Acids Res. 21, 1655–1664 (1993).

    Article  CAS  Google Scholar 

  20. Ghosh, D. Object-oriented transcription factors database (ooTFD). Nucleic Acids Res. 28, 308–310 ( 2000).

    Article  CAS  Google Scholar 

  21. Heinemeyer, T. et al. Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL. Nucleic Acids Res. 26, 362– 367 (1998).

    Article  CAS  Google Scholar 

  22. Kel-Margoulis, O. V., Romashchenko, A. G., Kolchanov, N. A., Wingender, E. & Kel, A. E. COMPEL: a database on composite regulatory elements providing combinatorial transcriptional regulation. Nucleic Acids Res. 28, 311–315 ( 2000).

    Article  CAS  Google Scholar 

  23. Morgenstern, B. DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15, 211– 218 (1999).

    Article  CAS  Google Scholar 

  24. Prestridge, D. S. SIGNAL SCAN 4.0: additional databases and sequence formats. Comput. Appl. Biosci. 12, 157–160 (1996).

    CAS  PubMed  Google Scholar 

  25. Prestridge, D. S. Computer software for eukaryotic promoter analysis. Methods Mol. Biol. 130, 265–295 ( 2000).

    CAS  PubMed  Google Scholar 

  26. Perier, R. C., Praz, V., Junier, T., Bonnard, C. & Bucher, P. The eukaryotic promoter database (EPD). Nucleic Acids Res. 28, 302–303 (2000).

    Article  CAS  Google Scholar 

  27. Quandt, K., Frech, K., Karas, H., Wingender, E. & Werner, T. MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23, 4878–4884 (1995).

    Article  CAS  Google Scholar 

  28. Wingender, E. et al. TRANSFAC: an integrated system for gene expression regulation . Nucleic Acids Res. 28, 316– 319 (2000).

    Article  CAS  Google Scholar 

  29. Werner, T. Computer-assisted analysis of transcription control regions. Matinspector and other programs. Methods Mol. Biol. 132, 337–349 (2000).

    CAS  PubMed  Google Scholar 

  30. Li, Q., Harju, S. & Peterson, K. R. Locus control regions: coming of age at a decade plus. Trends Genet. 15, 403– 408 (1999).A detailed summary of our current understanding of the β-globin locus control region.

    Article  Google Scholar 

  31. Lacy, D. A. et al. Faithful expression of the human 5q31 cytokine cluster in transgenic mice. J. Immunol. 164, 4569– 4574 (2000).

    Article  CAS  Google Scholar 

  32. Frazer, K. A., Narla, G., Zhang, J. L. & Rubin, E. M. The apolipoprotein(a) gene is regulated by sex hormones and acute-phase inducers in YAC transgenic mice. Nature Genet. 9, 424– 431 (1995).A transgenic study supporting the commonality of gene regulation between species.

    Article  CAS  Google Scholar 

  33. Jimenez, G., Gale, K. B. & Enver, T. The mouse β-globin locus control region: hypersensitive sites 3 and 4. Nucleic Acids Res. 20, 5797 –5803 (1992).

    Article  CAS  Google Scholar 

  34. Hood, L., Rowen, L. & Koop, B. F. Human and mouse T-cell receptor loci: genomics, evolution, diversity, and serendipity. Ann. NY Acad. Sci. 758, 390–412 (1995).

    Article  CAS  Google Scholar 

  35. Koop, B. F. & Hood, L. Striking sequence similarity over almost 100 kilobases of human and mouse T-cell receptor DNA. Nature Genet. 7, 48–53 (1994 ).An example of a large genomic region in human and mouse that is highly conserved, thus limiting regulatory sequence identification.

    Article  CAS  Google Scholar 

  36. Ho, P. J. & Thein, S. L. Gene regulation and deregulation: a β-globin perspective. Blood Rev. 14, 78–93 (2000).

    Article  CAS  Google Scholar 

  37. Talbot, D. et al. A dominant control region from the human β-globin locus conferring integration site-independent gene expression. Nature 338, 352–355 ( 1989).

    Article  CAS  Google Scholar 

  38. Dubchak, I. et al. Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res. 10, 1304 –1306 (2000).

    Article  CAS  Google Scholar 

  39. Hardison, R. et al. Sequence and comparative analysis of the rabbit α-like globin gene cluster reveals a rapid mode of evolution in a G+C-rich region of mammalian genomes. J. Mol. Biol. 222, 233–249 (1991).

    Article  CAS  Google Scholar 

  40. Bulyk, M. L., Gentalen, E., Lockhart, D. J. & Church, G. M. Quantifying DNA-protein interactions by double-stranded DNA arrays. Nature Biotechnol. 17, 573–577 (1999).

    Article  CAS  Google Scholar 

  41. Cavener, D. R. Comparison of the consensus sequence flanking translational start sites in Drosophila and vertebrates. Nucleic Acids Res. 15, 1353–1361 (1987).

    Article  CAS  Google Scholar 

  42. Werner, T. Models for prediction and recognition of eukaryotic promoters. Mamm. Genome 10, 168–175 (1999).

    Article  CAS  Google Scholar 

  43. Wagner, A. A computational genomics approach to the identification of gene networks. Nucleic Acids Res. 25, 3594–3604 (1997).

    Article  CAS  Google Scholar 

  44. van Helden, J., Andre, B. & Collado-Vides, J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842 (1998).

    Article  CAS  Google Scholar 

  45. Wagner, A. Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes. Bioinformatics 15, 776–784 (1999).

    Article  CAS  Google Scholar 

  46. Tagle, D. A. et al. Embryonic ɛ- and γ-globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J. Mol. Biol. 203, 439–455 ( 1988).

    Article  CAS  Google Scholar 

  47. Vuillaumier, S. et al. Cross-species characterization of the promoter region of the cystic fibrosis transmembrane conductance regulator gene reveals multiple levels of regulation. Biochem J. 327, 651 –662 (1997).

    Article  CAS  Google Scholar 

  48. Gumucio, D. L. et al. Evolutionary strategies for the elucidation of cis- and trans-factors that regulate the developmental switching programs of the β-like globin genes. Mol. Phylogenet. Evol. 5, 18–32 (1996). References 46 and 48 illustrate the power of comparative genomic analyses through phylogenetic footprints of globin genes.

    Article  CAS  Google Scholar 

  49. Antequera, F. & Bird, A. Number of CpG islands and genes in human and mouse. Proc. Natl Acad. Sci. USA 90, 11995–11999 (1993).

    Article  CAS  Google Scholar 

  50. Cross, S. H., Clark, V. H. & Bird, A. P. Isolation of CpG islands from large genomic clones . Nucleic Acids Res. 27, 2099– 2107 (1999).

    Article  CAS  Google Scholar 

  51. John, R. M., Robbins, C. A. & Myers, R. M. Identification of genes within CpG-enriched DNA from human chromosome 4p16.3. Hum. Mol. Genet. 3, 1611–1616 (1994).

    Article  CAS  Google Scholar 

  52. Watanabe, T. et al. Isolation of estrogen-responsive genes with a CpG island library . Mol. Cell. Biol. 18, 442– 449 (1998).

    Article  CAS  Google Scholar 

  53. Larsen, F., Gundersen, G. & Prydz, H. Choice of enzymes for mapping based on CpG islands in the human genome. Genet. Anal. Tech. Appl. 9, 80–85 (1992).

    Article  CAS  Google Scholar 

  54. Kato, R. & Sasaki, H. Quick identification and localization of CpG islands in large genomic fragments by partial digestion with HpaII and HhaI. DNA Res. 5, 287– 295 (1998).

    Article  CAS  Google Scholar 

  55. Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 ( 1999); erratum 404, 904 ( 2000).

    Article  CAS  Google Scholar 

  56. Ioshikhes, I. P. & Zhang, M. Q. Large-scale human promoter mapping using CpG islands. Nature Genet. 26 , 61–63 (2000).

    Article  CAS  Google Scholar 

  57. Bucher, P. Regulatory elements and expression profiles. Curr. Opin. Struct. Biol. 9, 400–407 ( 1999).

    Article  CAS  Google Scholar 

  58. Greenfield, A. Applications of DNA microarrays to the transcriptional analysis of mammalian genomes. Mamm. Genome 11, 609– 613 (2000).

    Article  CAS  Google Scholar 

  59. Hill, A. A., Hunter, C. P., Tsung, B. T., Tucker-Kellogg, G. & Brown, E. L. Genomic analysis of gene expression in C. elegans. Science 290, 809– 812 (2000).

    Article  CAS  Google Scholar 

  60. Wasserman, W. W., Palumbo, M., Thompson, W., Fickett, J. W. & Lawrence, C. E. Human-mouse genome comparisons to locate regulatory sites. Nature Genet. 26, 225–228 (2000).

    Article  CAS  Google Scholar 

  61. King, M. C. & Wilson, A. C. Evolution at two levels in humans and chimpanzees. Science 188, 107– 116 (1975).A landmark paper highlighting the large amount of sequence conservation between humans and chimpanzees, indicating that regulatory differences might account for the varying phenotypes between the two species.

    Article  CAS  Google Scholar 

  62. Luo, Z. In search of the whales' sisters. Nature 404, 235–237 (2000).

    Article  CAS  Google Scholar 

  63. Arnason, U., Gullberg, A., Gretarsdottir, S., Ursing, B. & Janke, A. The mitochondrial genome of the sperm whale and a new molecular reference for estimating eutherian divergence dates . J. Mol. Evol. 50, 569– 578 (2000).

    Article  CAS  Google Scholar 

  64. Ursing, B. M. & Arnason, U. Analyses of mitochondrial genomes strongly support a hippopotamus-whale clade. Proc. R. Soc. Lond. B 265, 2251–2255 ( 1998).

    Article  CAS  Google Scholar 

  65. Shimamura, M. et al. Molecular evidence from retroposons that whales form a clade within even-toed ungulates. Nature 388, 666–670 (1997).

    Article  CAS  Google Scholar 

  66. Nikaido, M., Rooney, A. P. & Okada, N. Phylogenetic relationships among cetartiodactyls based on insertions of short and long interpersed elements: hippopotamuses are the closest extant relatives of whales. Proc. Natl Acad. Sci. USA 96, 10261–10266 (1999).

    Article  CAS  Google Scholar 

  67. Blackwood, E. M. & Kadonaga, J. T. Going the distance: a current view of enhancer action. Science 281, 61–63 (1998).

    Article  Google Scholar 

  68. Fraser, P. & Grosveld, F. Locus control regions, chromatin activation and transcription. Curr. Opin. Cell Biol. 10, 361–365 (1998).

    Article  CAS  Google Scholar 

  69. Grosveld, F. Activation by locus control regions? Curr. Opin. Genet. Dev. 9, 152–157 (1999).

    Article  CAS  Google Scholar 

  70. Bell, A. C. & Felsenfeld, G. Stopped at the border: boundaries and insulators. Curr. Opin. Genet. Dev. 9, 191–198 (1999).

    Article  CAS  Google Scholar 

  71. Geyer, P. K. The role of insulator elements in defining domains of gene expression. Curr. Opin. Genet. Dev. 7, 242–248 (1997).

    Article  CAS  Google Scholar 

  72. Ogbourne, S. & Antalis, T. M. Transcriptional control and the role of silencers in transcriptional regulation in eukaryotes. Biochem J. 331, 1–14 ( 1998).

    Article  CAS  Google Scholar 

  73. Hart, C. M. & Laemmli, U. K. Facilitation of chromatin dynamics by SARs. Curr. Opin. Genet. Dev. 8, 519– 525 (1998).

    Article  CAS  Google Scholar 

  74. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  Google Scholar 

  75. Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B. & Lander, E. S. Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 10, 950–958 ( 2000).

    Article  CAS  Google Scholar 

  76. Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27, 2369–2376 ( 1999).

    Article  CAS  Google Scholar 

  77. Mayor, C. et al. VISTA: Visualizing global DNA sequence alignments of arbitrary length. Bioinformatics (in the press).

  78. Schwartz, S. et al. PipMaker — a web server for aligning two genomic DNA sequences. Genome Res. 10, 577– 586 (2000).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This research was supported by a Programs for Genomic Applications grant awarded to E.M.R. from the NHLBI and conducted at the E.O. Lawrence Berkeley National Laboratory, University of California, sponsored by the Department of Energy, as well as an appointment to the Alexander Hollaender Distinguished Postdoctoral Fellowship Program sponsored by the US Department of Energy, Office of Biological and Environmental Research, and administered by the Oak Ridge Institute for Science and Education (L.A.P.). We thank M. Biggin, J. Bristow, I. Dubchak, C. Prangeand D. Symula for their thoughtful comments.

Author information

Authors and Affiliations

Authors

Related links

Related links

DATABASE LINKS

LPA

GMCSF

IL4

IL13

IL5

SCL

FURTHER INFORMATION

TRANSFAC

Transcription Regulatory Region Database

COMPEL

Eukaryotic Promoter Database

VISTA

PipMaker

hepatic nuclear factor 4 (HNF4) position-weighted matrix

Glossary

DNASEI HYPERSENSITIVITY ASSAY

Identifies regions of the genome that lack nucleosome structure and are therefore readily degraded by the enzyme DNaseI. Such regions tend to be associated with transcriptional activity.

DNA FOOTPRINTING ASSAY

An assay that identifies a region of DNA that is protected from digestion by DNaseI (usually due to the binding of a protein, such as a transcription factor).

GEL SHIFT ASSAY

A gel-based assay in which proteins that bind to a DNA fragment are detected by virtue of the reduced migration of the DNA. The assay is often used to detect transcription factor binding.

CRE RECOMBINASE SYSTEM

A method in which the Cre enzyme catalyses recombination between loxP sequences. If the loxP sequences are arranged as a direct repeat, recombination will delete the DNA between the sites.

CPG ISLANDS

Sequences of at least 200 bp with greater than 50% G+C content and high CpG frequency.

FLAT FILE

A computer readable file or database in which records are not connected or 'related'. Similar to a card index.

RELATIONAL DATABASE

A storage format in which data items can be stored in separate files but linked together to form different relations. This system allows greater flexibility than a flat file format.

MLUI CELL-CYCLE BOX

An 8-bp motif (ACGCGTNA) that promotes the transcription of genes involved in DNA replication in yeast.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pennacchio, L., Rubin, E. Genomic strategies to identify mammalian regulatory sequences. Nat Rev Genet 2, 100–109 (2001). https://doi.org/10.1038/35052548

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/35052548

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing