Key Points
-
The identification of evolutionarily constrained sequences is an unbiased approach for finding functional sequences in genomes. However, their identification is strongly affected by upstream analyses.
-
Genomes are sequenced to different levels of finishing, which affects downstream comparative analyses.
-
Before genomic sequences can be aligned, segments of homologous collinearity must be identified. Errors at this stage can have a dramatic effect on the identification of constrained sequences.
-
Base-pair sequence-alignment programs differentially handle the complexities of evolution, such as insertions/deletions and duplications. In addition, new approaches to identify regions of alignment uncertainty can be used.
-
Current approaches that utilize evolutionary sequence constraint focus on regions that are deeply constrained. Newer approaches combined with sequences from more species can now be pursued to identify weakly and/or lineage-specific constrained sequences.
-
The amount of detectable constrained sequence depends on the phylogenetic scope being pursued as well as the resolution and intensity of the desired detectable constraint.
-
Large collaborative projects such as ENCODE are shedding light on the correlation between sequence constraint and sequence function. Additional methods are also available for determining the biological significance of constrained sequences.
Abstract
The comparison of genomic sequences is now a common approach to identifying and characterizing functional regions in vertebrate genomes. However, for theoretical reasons and because of practical issues, the generation of these data sets is non-trivial and can have many pitfalls. We are currently seeing an explosion of comparative sequence data, the benefits and limitations of which need to be disseminated to the scientific community. This Review provides a critical overview of the different types of sequence data that are available for analysis and of contemporary comparative sequence analysis methods, highlighting both their strengths and limitations. Approaches to determining the biological significance of constrained sequence are also explored.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Genome-wide identification of Drosophila dorso-ventral enhancers by differential histone acetylation analysis
Genome Biology Open Access 27 September 2016
-
An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions
Nature Genetics Open Access 30 June 2013
-
Cgaln: fast and space-efficient whole-genome alignment
BMC Bioinformatics Open Access 30 April 2010
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout




References
Miller, W., Makova, K. D., Nekrutenko, A. & Hardison, R. C. Comparative genomics. Annu. Rev. Genomics Hum. Genet. 5, 15–56 (2004).
Hardison, R. C. Comparative genomics. PLoS Biol. 1, E58 (2003).
Xie, X. et al. Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals. Nature 434, 338–345 (2005).
Göttgens, B. et al. Analysis of vertebrate SCL loci identifies conserved enhancers. Nature Biotechnol. 18, 181–186 (2000).
Loots, G. G. et al. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288, 136–140 (2000).
Kimura, M. & Ohta, T. On some principles governing molecular evolution. Proc. Natl Acad. Sci. USA 71, 2848–2852 (1974).
Margulies, E. H. Confidence in comparative genomics. Genome Res. 18, 199–200 (2008).
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
International Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002). The first vertebrate genome-wide comparative sequence analysis. The many seminal findings include initial estimates of the extent of evolutionary constraint among mammalian genomes and the fact that there is more than twice as much non-coding constrained sequence compared with protein-coding regions.
Margulies, E. H., NISC Comparative Sequencing Program & Green, E. D. Detecting highly conserved regions of the human genome by multispecies sequence comparisons. Cold Spring Harb. Symp. Quant. Biol. 68, 255–263 (2003).
Thomas, J. W. et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793 (2003). This landmark manuscript was one of the first to highlight the power of sequencing and analysing the genomes of many species and provided intellectual support for sequencing many vertebrate genomes.
Eddy, S. R. A model of the statistical power of comparative genome sequence analysis. PLoS Biol. 3, e10 (2005). The 'go-to' manuscript outlining the theoretical basis for choosing the amount and diversity of genomes to sequence in order to obtain a certain level of resolution for constrained sequences.
Green, E. D. Strategies for the systematic sequencing of complex genomes. Nature Rev. Genet. 2, 573–583 (2001). An excellent review outlining the approaches used for sequencing genomes.
International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
Blakesley, R. W. et al. An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 14, 2235–2244 (2004).
Margulies, E. H. et al. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Natl Acad. Sci. USA 102, 4795–4800 (2005).
Green, P. 2x genomes — does depth matter? Genome Res. 17, 1547–1549 (2007).
Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988).
ENCODE Project Consortium. The ENCODE pilot project: functional annotation of 1% of the human genome. Nature 447, 799–816 (2007). This manuscript, along with Reference 20 and the entire April 2007 issue of Genome Research , highlights the first systematic identification and analysis of functional elements in the human genome. Of particular interest here is the section describing the extent of evolutionary sequence constraint that was observed in functional elements.
Margulies, E. H. et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17, 760–774 (2007).
Bentley, D. R. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 16, 545–552 (2006).
Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450, 203–218 (2007). This manuscript, along with Reference 56, highlights seminal multi-species sequence comparisons on a genome-wide scale. Furthermore, Reference 56 presents new approaches for computationally identifying and classifying the functions of constrained sequences.
Stein, L. D. et al. The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol. 1, E45 (2003).
Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E. S. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241–254 (2003).
Dewey, C. N. Aligning multiple whole genomes with Mercator and MAVID. Methods Mol. Biol. 395, 221–236 (2007).
Tesler, G. GRIMM: genome rearrangements web server. Bioinformatics 18, 492–493 (2002).
Kent, W. J., Baertsch, R., Hinrichs, A., Miller, W. & Haussler, D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl Acad. Sci. USA 100, 11484–11489 (2003).
Ma, J. et al. Reconstructing contiguous regions of an ancestral genome. Genome Res. 16, 1557–1565 (2006).
Lunter, G. et al. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res. 18, 298–309 (2008).
Holmes, I. & Durbin, R. Dynamic programming alignment accuracy. J. Comput. Biol. 5, 493–504 (1998).
Wong, K. M., Suchard, M. A. & Huelsenbeck, J. P. Alignment uncertainty and genomic analysis. Science 319, 473–476 (2008).
Delcher, A. L., Phillippy, A., Carlton, J. & Salzberg, S. L. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30, 2478–2483 (2002).
Higgins, D. G. & Sharp, P. M. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237–244 (1988).
Brudno, M. et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721–731 (2003).
Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708–715 (2004).
Margulies, E. H., Chen, C. W. & Green, E. D. Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons. Trends Genet. 22, 187–193 (2006).
Notredame, C., Higgins, D. G. & Heringa, J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000).
Do, C. B., Mahabhashyam, M. S. P., Brudno, M. & Batzoglou, S. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 15, 330–340 (2005).
Bray, N. & Pachter, L. MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 14, 693–699 (2004).
Pollard, D. A., Bergman, C. M., Stoye, J., Celniker, S. E. & Eisen, M. B. Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics 5, 6 (2004).
Stone, E. A., Cooper, G. M. & Sidow, A. Trade-offs in detecting evolutionarily constrained sequence by comparative genomics. Annu. Rev. Genomics Hum. Genet. 6, 143–164 (2005). An excellent review outlining how various comparative sequence analysis methods combined with different sequence data sets affect the sensitivity, specificity and phylogenetic scope for detecting constrained sequences.
Margulies, E. H., Blanchette, M., NISC Comparative Sequencing Program, Haussler, D. & Green, E. D. Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507–2518 (2003). This was the first report of a computational approach for logically combining the conservation information from many species' sequences for the identification of constrained elements. Also see References 43 and 45 for subsequent methods.
Cooper, G. M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, UK,1998).
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Kamal, M., Xie, X. & Lander, E. S. A large family of ancient repeat elements in the human genome is under strong selection. Proc. Natl Acad. Sci. USA 103, 2740–2745 (2006).
Siepel, A., Pollard, K. S. & Haussler, D. in Proc. 10th Int. Conf. Res. Comput. Mol. Biol. (eds Apostolico, A., Guerra, C., Istrail, S., Pevzner, P. & Waterman, M.) 190–205 (Springer, Berlin, 2006).
Rhesus Macaque Genome Sequencing and Analysis Consortium. Evolutionary and biomedical insights from the Rhesus macaque genome. Science 316, 222–234 (2007).
Boffelli, D. et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394 (2003).
Prabhakar, S. et al. Close sequence comparisons are sufficient to identify human cis-regulatory elements. Genome Res. 16, 855–863 (2006).
Wang, Q.-f. et al. Detection of weakly conserved ancestral mammalian regulatory sequences by primate comparisons. Genome Biol. 8, R1 (2007).
Wang, Q.-f. et al. Primate-specific evolution of an LDLR enhancer. Genome Biol. 7, R68 (2006).
Moses, A. M. et al. Large-scale turnover of functional transcription factor binding sites in Drosophila. PLoS Comput. Biol. 2, e130 (2006).
Blanchette, M. & Tompa, M. Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 12, 739–748 (2002).
Wang, T. & Stormo, G. D. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics 19, 2369–2380 (2003).
Stark, A. et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450, 219–232 (2007).
Mortlock, D. P., Guenther, C. & Kingsley, D. M. A general approach for identifying distant regulatory elements applied to the Gdf6 gene. Genome Res. 13, 2069–2081 (2003).
Pennacchio, L. A. et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature 444, 499–502 (2006).
Fisher, S., Grice, E. A., Vinton, R. M., Bessling, S. L. & McCallion, A. S. Conservation of RET regulatory function from human to zebrafish without sequence similarity. Science 312, 276–279 (2006).
Kuhn, R. M. et al. The UCSC genome browser database: update 2007. Nucleic Acids Res. 35, D668–D673 (2007).
Flicek, P. et al. Ensembl 2008. Nucleic Acids Res. 36, D707–D714 (2007).
Spudich, G., Fernández-Suárez, X. M. & Birney, E. Genome browsing with Ensembl: a practical overview. Brief. Funct. Genomic. Proteomic. 6, 202–219 (2007).
Giardine, B. et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 15, 1451–1455 (2005).
Blankenberg, D. et al. A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly. Genome Res. 17, 960–964 (2007).
Visel, A., Minovitsky, S., Dubchak, I. & Pennacchio, L. A. VISTA Enhancer Browser — a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007).
Author information
Authors and Affiliations
Corresponding author
Supplementary information
Related links
Glossary
- Purifying selection
-
The evolutionary process of rejecting substitutions in functional DNA, thereby making such sequences more similar when compared among different species.
- Orthologous
-
Homologous sequences in different species that arose from a speciation event.
- Homologous
-
Sequences that have a common ancestor but might be related by either speciation or duplication events in a genome. Pragmatically, homology is detected by the presence of similarity between two sequences.
- Contig
-
Contiguous piece of DNA that is assembled from shorter overlapping sequence reads.
- Whole-genome shotgun
-
The process of shearing the DNA from an entire genome into smaller pieces that are randomly (or 'shotgun') sequenced en masse.
- Genome coverage
-
The total number of bases that are sequenced divided by the genome size. Actual coverage differs depending on the statistical properties of a poisson distribution, which takes into account the fact that reads are sequenced at random.
- Segmental duplications
-
Regions (or segments) of a genome that evolved from a single common ancestor and arose from a duplication event. As such, these sequences within the same genome are paralogous to each other.
- Pseudogene
-
A presumably non-functional region of DNA with homology to an actual gene. Pseudogenes typically arise from the reincorporation of an RNA intermediate into the genomic sequence.
- Paralogous
-
The homology between two genomic segments that arose from a duplication event.
- Heuristic methods
-
For large compute problems, the application of workable but not formally correct solutions to help reduce the computational time. In the case of sequence alignment, common heuristic methods include the progressive alignment of closer sequences to each other first before aligning to more distant species, and the use of highly similar anchoring sequences to reduce the search space in the alignment.
- Compute farms
-
Large groups of computers, each on their own only able to analyse a small piece of data (similar to a typical desktop PC), but which, when combined together, provide a powerful resource for analysing computationally intense problems.
- Dynamic programming
-
An algorithm that is efficiently designed to analyse data, usually by elegantly breaking down the computational problem down into smaller, simpler sub-problems.
- Suffix tree
-
An indexing technique to efficiently store all sub-sequences of a string of letters.
- Hidden markov model
-
Mathematical concept that describes a finite set of 'states' and a probabilistic model for transitioning from one state to another.
- Ancestral repeat
-
Relics of transposable elements that inserted before a speciation event and are therefore orthologous and presumed to be non-functional. Therefore, these regions are largely thought to be neutrally evolving.
- Eutherian radiation
-
Approximatey 80 million years ago a large diversity of mammalian species began to evolve. These placental mammals provide a rich resource for identifying constrained sequences.
- Four-fold degenerate sites
-
Third positions of codons for which any base yields the same amino acid.
- False-discovery rate
-
A statistical measure of error, specifically defined as 1 – (true positives / (true + false positives)). Such an error estimate allows for greater fluctuation in the total amount of detected true-positives, as it reflects the proportion of false positives in the resulting data set rather than an absolute value of false positives.
Rights and permissions
About this article
Cite this article
Margulies, E., Birney, E. Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes. Nat Rev Genet 9, 303–313 (2008). https://doi.org/10.1038/nrg2185
Issue Date:
DOI: https://doi.org/10.1038/nrg2185
This article is cited by
-
Genome-wide identification of Drosophila dorso-ventral enhancers by differential histone acetylation analysis
Genome Biology (2016)
-
A Pharm-Ecological Perspective of Terrestrial and Aquatic Plant-Herbivore Interactions
Journal of Chemical Ecology (2013)
-
An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions
Nature Genetics (2013)
-
Cgaln: fast and space-efficient whole-genome alignment
BMC Bioinformatics (2010)
-
A reference guide for tree analysis and visualization
BioData Mining (2010)