Comparative assessment of methods for aligning multiple genome sequences

Chen, Xiaoyu; Tompa, Martin

doi:10.1038/nbt.1637

Analysis
Published: 23 May 2010

Comparative assessment of methods for aligning multiple genome sequences

Xiaoyu Chen¹ &
Martin Tompa¹

Nature Biotechnology volume 28, pages 567–572 (2010)Cite this article

1793 Accesses
28 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Multiple sequence alignment is a difficult computational problem. There have been compelling pleas for methods to assess whole-genome multiple sequence alignments and compare the alignments produced by different tools. We assess the four ENCODE alignments, each of which aligns 28 vertebrates on 554 Mbp of total input sequence. We measure the level of agreement among the alignments and compare their coverage and accuracy. We find a disturbing lack of agreement among the alignments not only in species distant from human, but even in mouse, a well-studied model organism. Overall, the assessment shows that Pecan produces the most accurate or nearly most accurate alignment in all species and genomic location categories, while still providing coverage comparable to or better than that of the other alignments in the placental mammals. Our assessment reveals that constructing accurate whole-genome multiple sequence alignments remains a significant challenge, particularly for noncoding regions and distantly related species.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Comparison of coverage of the alignments.**

**Figure 2: Comparison percentages agree%, unique% and disagree% for TBA, MAVID and MLAGAN.**

**Figure 3: Comparison of accuracy of the alignments, as measured by suspicious%.**

**Figure 4: Pairwise alignment scores of suspicious regions versus those for alternative alignments of the same human region.**

**Figure 5: Summary plot of suspicious% vs. coverage, aggregated over all four location categories.**

Towards complete and error-free genome assemblies of all vertebrate species

Article Open access 28 April 2021

Arang Rhie, Shane A. McCarthy, … Erich D. Jarvis

Progressive Cactus is a multiple-genome aligner for the thousand-genome era

Article Open access 11 November 2020

Joel Armstrong, Glenn Hickey, … Benedict Paten

Systematic discovery of conservation states for single-nucleotide annotation of the human genome

Article Open access 02 July 2019

Adriana Arneson & Jason Ernst

References

Kent, W. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Article CAS Google Scholar
Woolfe, A. et al. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 3, e7 (2005).
Article Google Scholar
Xie, X. et al. Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of CTCF insulator sites. Proc. Natl. Acad. Sci. USA 104, 7145–7150 (2007).
Article CAS Google Scholar
Gross, S.S. & Brent, M.R. Using multiple alignments to improve gene prediction. J. Comput. Biol. 13, 379–393 (2006).
Article CAS Google Scholar
Siepel, A. et al. Targeted discovery of novel human exons by comparative genomics. Genome Res. 17, 1763–1773 (2007).
Article CAS Google Scholar
Pedersen, J.S. et al. Identification and classification of conserved RNA secondary structures in the human genome. PLOS Comput. Biol. 2, e33 (2006).
Article CAS Google Scholar
Washietl, S., Hofacker, I.L., Lukasser, M., Hüttenhofer, A. & Stadler, P.F. Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat. Biotechnol. 23, 1383–1390 (2005).
Article CAS Google Scholar
Cooper, G.M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).
Article CAS Google Scholar
Margulies, E. et al. Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507–2518 (2003).
Article CAS Google Scholar
Margulies, E.H. et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17, 760–774 (2007).
Article CAS Google Scholar
Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).
Article CAS Google Scholar
Prabhakar, S. et al. Close sequence comparisons are sufficient to identify human cis-regulatory elements. Genome Res. 16, 855–863 (2006).
Article CAS Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Article CAS Google Scholar
Felsenstein, J. Inferring Phylogenies (Sinauer Associates, 2004).
Wong, K.M., Suchard, M.A. & Huelsenbeck, J.P. Alignment uncertainty and genomic analysis. Science 319, 473–476 (2008).
Article CAS Google Scholar
Siepel, A. & Haussler, D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 21, 468–488 (2004).
Article CAS Google Scholar
Murphy, W.J. et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294, 2348–2351 (2001).
Article CAS Google Scholar
Nikolaev, S. et al. Early history of mammals is elucidated with the ENCODE multiple species sequencing data. PLoS Genet. 3, e2 (2007).
Article Google Scholar
Bird, C.P. et al. Fast-evolving noncoding sequences in the human genome. Genome Biol. 8, R118 (2007).
Article Google Scholar
Kim, S. & Pritchard, J. Adaptive evolution of conserved non-coding elements in mammals. PLoS Genet. 3, e147 (2007).
Article Google Scholar
Nielsen, R. et al. A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biol. 3, e170 (2005).
Article Google Scholar
Pollard, K.S. et al. An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443, 167–172 (2006).
Article CAS Google Scholar
Prabhakar, S., Noonan, J.P., Pääbo, S. & Rubin, E.M. Accelerated evolution of conserved noncoding sequences in humans. Science 314, 786 (2006).
Article CAS Google Scholar
Dewey, C.N., Huggins, P.M., Woods, K., Sturmfels, B. & Pachter, L. Parametric alignment of Drosophila genomes. PLOS Comput. Biol. 2, e73 (2006).
Article Google Scholar
Blanchette, M. Computation and analysis of genomic multi-sequence alignments. Annu. Rev. Genomics Hum. Genet. 8, 193–213 (2007).
Article CAS Google Scholar
Kumar, S. & Filipski, A. Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res. 17, 127–135 (2007).
Article CAS Google Scholar
Lunter, G. et al. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res. 18, 298–309 (2008).
Article CAS Google Scholar
Margulies, E.H. Confidence in comparative genomics. Genome Res. 18, 199–200 (2008).
Article CAS Google Scholar
Margulies, E.H. & Birney, E. Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes. Nat. Rev. Genet. 9, 303–313 (2008).
Article CAS Google Scholar
Rokas, A. Lining up to avoid bias. Science 319, 416–417 (2008).
Article CAS Google Scholar
Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708–715 (2004).
Article CAS Google Scholar
Bray, N. & Pachter, L. MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 14, 693–699 (2004).
Article CAS Google Scholar
Brudno, M. et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721–731 (2003).
Article CAS Google Scholar
Paten, B., Herrero, J., Beal, K., Fitzgerald, S. & Birney, E. Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res. 18, 1814–1828 (2008).
Article CAS Google Scholar
Prakash, A. & Tompa, M. Measuring the accuracy of genome-size multiple alignments. Genome Biol. 8, R124 (2007).
Article Google Scholar
Dubchak, I., Poliakov, A., Kislyuk, A. & Brudno, M. Multiple whole-genome alignments without a reference organism. Genome Res. 19, 682–689 (2009).
Article CAS Google Scholar
Prakash, A. & Tompa, M. Assessing the discordance of multiple sequence alignments. IEEE/ACM Trans. Comput. Biol. Bioinformatics 6, 542–551 (2009).
Article CAS Google Scholar
Wang, L. & Jiang, T. On the complexity of multiple sequence alignment. J. Comput. Biol. 1, 337–348 (1994).
Article CAS Google Scholar
Karlin, S. & Altschul, S.F. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. USA 90, 5873–5877 (1993).
Article CAS Google Scholar
States, D.J., Gish, W. & Altschul, S.F. Improved sensitivity in nucleic acid database searches using application-specific scoring matrices. Methods: A Companion to Methods in Enzymology 3, 66–70 (1991).
Article CAS Google Scholar

Download references

Acknowledgements

We thank P. Green, W. Noble, W.L. Ruzzo and especially A. Prakash for helpful discussions and technical advice. We thank the US National Institutes of Health and the Natural Sciences and Engineering Research Council of Canada for financial support.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Department of Genome Sciences, University of Washington, Seattle, Washington, USA
Xiaoyu Chen & Martin Tompa

Authors

Xiaoyu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Martin Tompa
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

X.C., design, implementation, experimentation, analysis; M.T., design, analysis.

Corresponding author

Correspondence to Martin Tompa.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, X., Tompa, M. Comparative assessment of methods for aligning multiple genome sequences. Nat Biotechnol 28, 567–572 (2010). https://doi.org/10.1038/nbt.1637

Download citation

Received: 21 December 2009
Accepted: 27 April 2010
Published: 23 May 2010
Issue Date: June 2010
DOI: https://doi.org/10.1038/nbt.1637

This article is cited by

Inferring synteny between genome assemblies: a systematic evaluation
- Dang Liu
- Martin Hunt
- Isheng J Tsai
BMC Bioinformatics (2018)
Testing robustness of relative complexity measure method constructing robust phylogenetic trees for Galanthus L. Using the relative complexity measure
- Yasin Bakış
- Hasan H Otu
- O Uğur Sezerman
BMC Bioinformatics (2013)
Novel algorithm for phylogenetic analysis of proteins: application to analysis of the evolution of H5N1 influenza viruses
- Vladimir R. Perovic
Journal of Mathematical Chemistry (2013)
Thermally assisted quantum annealing of a 16-qubit problem
- N G Dickson
- M W Johnson
- G Rose
Nature Communications (2013)
Rigorous and thorough bioinformatic analyses of olfactory receptor promoters confirm enrichment of O/E and homeodomain binding sites but reveal no new common motifs
- Janet M Young
- Ralf M Luche
- Barbara J Trask
BMC Genomics (2011)