Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Comparative assessment of methods for aligning multiple genome sequences

Abstract

Multiple sequence alignment is a difficult computational problem. There have been compelling pleas for methods to assess whole-genome multiple sequence alignments and compare the alignments produced by different tools. We assess the four ENCODE alignments, each of which aligns 28 vertebrates on 554 Mbp of total input sequence. We measure the level of agreement among the alignments and compare their coverage and accuracy. We find a disturbing lack of agreement among the alignments not only in species distant from human, but even in mouse, a well-studied model organism. Overall, the assessment shows that Pecan produces the most accurate or nearly most accurate alignment in all species and genomic location categories, while still providing coverage comparable to or better than that of the other alignments in the placental mammals. Our assessment reveals that constructing accurate whole-genome multiple sequence alignments remains a significant challenge, particularly for noncoding regions and distantly related species.

Your institute does not have access to this article

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Figure 1: Comparison of coverage of the alignments.
Figure 2: Comparison percentages agree%, unique% and disagree% for TBA, MAVID and MLAGAN.
Figure 3: Comparison of accuracy of the alignments, as measured by suspicious%.
Figure 4: Pairwise alignment scores of suspicious regions versus those for alternative alignments of the same human region.
Figure 5: Summary plot of suspicious% vs. coverage, aggregated over all four location categories.

References

  1. Kent, W. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).

    CAS  Article  Google Scholar 

  2. Woolfe, A. et al. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 3, e7 (2005).

    Article  Google Scholar 

  3. Xie, X. et al. Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of CTCF insulator sites. Proc. Natl. Acad. Sci. USA 104, 7145–7150 (2007).

    CAS  Article  Google Scholar 

  4. Gross, S.S. & Brent, M.R. Using multiple alignments to improve gene prediction. J. Comput. Biol. 13, 379–393 (2006).

    CAS  Article  Google Scholar 

  5. Siepel, A. et al. Targeted discovery of novel human exons by comparative genomics. Genome Res. 17, 1763–1773 (2007).

    CAS  Article  Google Scholar 

  6. Pedersen, J.S. et al. Identification and classification of conserved RNA secondary structures in the human genome. PLOS Comput. Biol. 2, e33 (2006).

    CAS  Article  Google Scholar 

  7. Washietl, S., Hofacker, I.L., Lukasser, M., Hüttenhofer, A. & Stadler, P.F. Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat. Biotechnol. 23, 1383–1390 (2005).

    CAS  Article  Google Scholar 

  8. Cooper, G.M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).

    CAS  Article  Google Scholar 

  9. Margulies, E. et al. Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507–2518 (2003).

    CAS  Article  Google Scholar 

  10. Margulies, E.H. et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17, 760–774 (2007).

    CAS  Article  Google Scholar 

  11. Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).

    CAS  Article  Google Scholar 

  12. Prabhakar, S. et al. Close sequence comparisons are sufficient to identify human cis-regulatory elements. Genome Res. 16, 855–863 (2006).

    CAS  Article  Google Scholar 

  13. Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).

    CAS  Article  Google Scholar 

  14. Felsenstein, J. Inferring Phylogenies (Sinauer Associates, 2004).

  15. Wong, K.M., Suchard, M.A. & Huelsenbeck, J.P. Alignment uncertainty and genomic analysis. Science 319, 473–476 (2008).

    CAS  Article  Google Scholar 

  16. Siepel, A. & Haussler, D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 21, 468–488 (2004).

    CAS  Article  Google Scholar 

  17. Murphy, W.J. et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294, 2348–2351 (2001).

    CAS  Article  Google Scholar 

  18. Nikolaev, S. et al. Early history of mammals is elucidated with the ENCODE multiple species sequencing data. PLoS Genet. 3, e2 (2007).

    Article  Google Scholar 

  19. Bird, C.P. et al. Fast-evolving noncoding sequences in the human genome. Genome Biol. 8, R118 (2007).

    Article  Google Scholar 

  20. Kim, S. & Pritchard, J. Adaptive evolution of conserved non-coding elements in mammals. PLoS Genet. 3, e147 (2007).

    Article  Google Scholar 

  21. Nielsen, R. et al. A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biol. 3, e170 (2005).

    Article  Google Scholar 

  22. Pollard, K.S. et al. An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443, 167–172 (2006).

    CAS  Article  Google Scholar 

  23. Prabhakar, S., Noonan, J.P., Pääbo, S. & Rubin, E.M. Accelerated evolution of conserved noncoding sequences in humans. Science 314, 786 (2006).

    CAS  Article  Google Scholar 

  24. Dewey, C.N., Huggins, P.M., Woods, K., Sturmfels, B. & Pachter, L. Parametric alignment of Drosophila genomes. PLOS Comput. Biol. 2, e73 (2006).

    Article  Google Scholar 

  25. Blanchette, M. Computation and analysis of genomic multi-sequence alignments. Annu. Rev. Genomics Hum. Genet. 8, 193–213 (2007).

    CAS  Article  Google Scholar 

  26. Kumar, S. & Filipski, A. Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res. 17, 127–135 (2007).

    CAS  Article  Google Scholar 

  27. Lunter, G. et al. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res. 18, 298–309 (2008).

    CAS  Article  Google Scholar 

  28. Margulies, E.H. Confidence in comparative genomics. Genome Res. 18, 199–200 (2008).

    CAS  Article  Google Scholar 

  29. Margulies, E.H. & Birney, E. Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes. Nat. Rev. Genet. 9, 303–313 (2008).

    CAS  Article  Google Scholar 

  30. Rokas, A. Lining up to avoid bias. Science 319, 416–417 (2008).

    CAS  Article  Google Scholar 

  31. Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708–715 (2004).

    CAS  Article  Google Scholar 

  32. Bray, N. & Pachter, L. MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 14, 693–699 (2004).

    CAS  Article  Google Scholar 

  33. Brudno, M. et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721–731 (2003).

    CAS  Article  Google Scholar 

  34. Paten, B., Herrero, J., Beal, K., Fitzgerald, S. & Birney, E. Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res. 18, 1814–1828 (2008).

    CAS  Article  Google Scholar 

  35. Prakash, A. & Tompa, M. Measuring the accuracy of genome-size multiple alignments. Genome Biol. 8, R124 (2007).

    Article  Google Scholar 

  36. Dubchak, I., Poliakov, A., Kislyuk, A. & Brudno, M. Multiple whole-genome alignments without a reference organism. Genome Res. 19, 682–689 (2009).

    CAS  Article  Google Scholar 

  37. Prakash, A. & Tompa, M. Assessing the discordance of multiple sequence alignments. IEEE/ACM Trans. Comput. Biol. Bioinformatics 6, 542–551 (2009).

    CAS  Article  Google Scholar 

  38. Wang, L. & Jiang, T. On the complexity of multiple sequence alignment. J. Comput. Biol. 1, 337–348 (1994).

    CAS  Article  Google Scholar 

  39. Karlin, S. & Altschul, S.F. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. USA 90, 5873–5877 (1993).

    CAS  Article  Google Scholar 

  40. States, D.J., Gish, W. & Altschul, S.F. Improved sensitivity in nucleic acid database searches using application-specific scoring matrices. Methods: A Companion to Methods in Enzymology 3, 66–70 (1991).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We thank P. Green, W. Noble, W.L. Ruzzo and especially A. Prakash for helpful discussions and technical advice. We thank the US National Institutes of Health and the Natural Sciences and Engineering Research Council of Canada for financial support.

Author information

Authors and Affiliations

Authors

Contributions

X.C., design, implementation, experimentation, analysis; M.T., design, analysis.

Corresponding author

Correspondence to Martin Tompa.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Chen, X., Tompa, M. Comparative assessment of methods for aligning multiple genome sequences. Nat Biotechnol 28, 567–572 (2010). https://doi.org/10.1038/nbt.1637

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt.1637

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing