Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Molecular phylogenetics: principles and practice

Key Points

  • The rapid accumulation of genome sequence data has made phylogenetics an indispensable tool to various branches of biology. However, it has also posed considerable statistical and computational challenges to data analysis.

  • Distance, parsimony, likelihood and Bayesian methods of phylogenetic analysis have different strengths and weaknesses. Although distance methods are good for large data sets of highly similar sequences, likelihood and Bayesian methods often have more power and are more robust, especially for inferring deep phylogenies.

  • Assessing phylogenetic uncertainty remains a difficult statistical problem.

  • Data partitioning may have an important influence on the phylogenetic analysis of genome-scale data sets.

  • Systematic biases, such as long-branch attraction, may be more important than random sampling errors in the analysis of genomic-scale data sets.

Abstract

Phylogenies are important for addressing various biological questions such as relationships among species or genes, the origin and spread of viral infection and the demographic changes and migration patterns of species. The advancement of sequencing technologies has taken phylogenetic analysis to a new height. Phylogenies have permeated nearly every branch of biology, and the plethora of phylogenetic methods and software packages that are now available may seem daunting to an experimental biologist. Here, we review the major methods of phylogenetic analysis, including parsimony, distance, likelihood and Bayesian methods. We discuss their strengths and weaknesses and provide guidance for their use.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Markov models of nucleotide substitution.
Figure 2: The neighbour joining algorithm.
Figure 3: Long-branch attraction in theory and in practice.

References

  1. 1

    Maser, P. et al. Phylogenetic relationships within cation transporter families of Arabidopsis. Plant Physiol. 126, 1646–1667 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 2

    Edwards, S. V. Is a new and general theory of molecular systematics emerging? Evolution 63, 1–19 (2009).

    CAS  PubMed  Google Scholar 

  3. 3

    Marra, M. A. et al. The genome sequence of the SARS-associated coronavirus. Science 300, 1399–1404 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4

    Grenfell, B. T. et al. Unifying the epidemiological and evolutionary dynamics of pathogens. Science 303, 327–332 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5

    Salipante, S. J. & Horwitz, M. S. Phylogenetic fate mapping. Proc. Natl Acad. Sci. USA 103, 5448–5453 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. 6

    Gray, R. D., Drummond, A. J. & Greenhill, S. J. Language phylogenies reveal expansion pulses and pauses in pacific settlement. Science 323, 479–483 (2009).

    CAS  PubMed  Google Scholar 

  7. 7

    Brady, A. & Salzberg, S. PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nature Methods 8, 367 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8

    Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E. S. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241–254 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. 9

    Pedersen, J. S. et al. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput. Biol. 2, e33 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10

    Lindblad-Toh, K. et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478, 476–482 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11

    Green, R. E. et al. A draft sequence of the Neandertal genome. Science 328, 710–722 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12

    Gronau, I., Hubisz, M. J., Gulko, B., Danko, C. G. & Siepel, A. Bayesian inference of ancient human demography from individual genome sequences. Nature Genet. 43, 1031–1034 (2011).

    CAS  PubMed  Google Scholar 

  13. 13

    Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14

    Paten, B. et al. Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res. 18, 1829–1843 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15

    Ma, J. Reconstructing the history of large-scale genomic changes: biological questions and computational challenges. J. Comput. Biol. 18, 879–893 (2011).

    CAS  PubMed  Google Scholar 

  16. 16

    Kingman, J. F. C. On the genealogy of large populations. J. Appl. Probab. 19A, 27–43 (1982).

    Google Scholar 

  17. 17

    Kingman, J. F. C. The coalescent. Stoch. Process. Appl. 13, 235–248 (1982).

    Google Scholar 

  18. 18

    Edwards, S. V., Liu, L. & Pearl, D. K. High-resolution species trees without concatenation. Proc. Natl Acad. Sci. USA 104, 5936–5941 (2007). This paper introduces a method for estimating the species tree despite the presence of conflicting gene trees.

    CAS  PubMed  Google Scholar 

  19. 19

    Than, C. & Nakhleh, L. Species tree inference by minimizing deep coalescences. PLoS Comput. Biol. 5, e1000501 (2009).

    PubMed  PubMed Central  Google Scholar 

  20. 20

    Rannala, B. & Yang, Z. Phylogenetic inference using whole genomes. Annu. Rev. Genomics Hum. Genet. 9, 217–231 (2008).

    CAS  PubMed  Google Scholar 

  21. 21

    Felsenstein, J. Phylogenies and the comparative method. Am. Nat. 125, 1–15 (1985). This paper introduces the bootstrap approach to phylogenetic analysis. This is the most commonly used method for assessing sampling errors in estimated phylogenies.

    Google Scholar 

  22. 22

    Yang, Z. in Handbook of Statistical Genetics (eds Balding, D., Bishop, M. & Cannings, C.) 377–406 (Wiley, New York, 2007).

    Google Scholar 

  23. 23

    Felsenstein, J. Inferring Phylogenies (Sinauer Associates, Sunderland, Massachusetts, 2004).

    Google Scholar 

  24. 24

    Yang, Z. Computational Molecular Evolution (Oxford Univ. Press, UK, 2006).

    Google Scholar 

  25. 25

    Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26

    Jukes, T. H. & Cantor, C. R. in Mammalian Protein Metabolism (ed. Munro, H. N.) 21–123 (Academic Press, New York, 1969).

    Google Scholar 

  27. 27

    Kimura, M. A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111–120 (1980).

    CAS  Google Scholar 

  28. 28

    Hasegawa, M., Kishino, H. & Yano, T. Dating the human–ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174 (1985).

    CAS  Google Scholar 

  29. 29

    Tavaré, S. Some probabilistic and statistical problems on the analysis of DNA sequences. Lect. Math. Life Sci. 17, 57–86 (1986).

    Google Scholar 

  30. 30

    Yang, Z. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39, 105–111 (1994).

    PubMed  Google Scholar 

  31. 31

    Yang, Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10, 1396–1401 (1993).

    CAS  PubMed  Google Scholar 

  32. 32

    Cavalli-Sforza, L. L. & Edwards, A. W. F. Phylogenetic analysis: models and estimation procedures. Evolution 21, 550–570 (1967).

    CAS  PubMed  Google Scholar 

  33. 33

    Fitch, W. M. & Margoliash, E. Construction of phylogenetic trees. Science 155, 279–284 (1967).

    CAS  PubMed  Google Scholar 

  34. 34

    Rzhetsky, A. & Nei, M. A simple method for estimating and testing minimum-evolution trees. Mol. Biol. Evol. 9, 945–967 (1992).

    CAS  Google Scholar 

  35. 35

    Desper, R. & Gascuel, O. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J. Comput. Biol. 9, 687–705 (2002).

    CAS  PubMed  Google Scholar 

  36. 36

    Gascuel, O. & Steel, M. Neighbor-joining revealed. Mol. Biol. Evol. 23, 1997–2000 (2006).

    CAS  PubMed  Google Scholar 

  37. 37

    Tamura, K. et al. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol. Biol. Evol. 28, 2731–2739 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38

    Bruno, W. J., Socci, N. D. & Halpern, A. L. Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. 17, 189–197 (2000).

    CAS  PubMed  Google Scholar 

  39. 39

    Fitch, W. M. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 20, 406–416 (1971).

    Google Scholar 

  40. 40

    Hartigan, J. A. Minimum evolution fits to a given tree. Biometrics 29, 53–65 (1973).

    Google Scholar 

  41. 41

    Swofford, D. L. PAUP*: Phylogenetic Analysis by Parsimony (and Other Methods)4.0 Beta (Sinauer Associates, Massachusetts, 2000).

    Google Scholar 

  42. 42

    Goloboff, P. A., Farris, J. S. & Nixon, K. C. TNT, a free program for phylogenetic analysis. Cladistics 24, 774–786 (2008).

    Google Scholar 

  43. 43

    Felsenstein, J. Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. 27, 401–410 (1978).

    Google Scholar 

  44. 44

    Huelsenbeck, J. P. Systematic bias in phylogenetic analysis: is the Strepsiptera problem solved? Syst. Biol. 47, 519–537 (1998).

    CAS  PubMed  Google Scholar 

  45. 45

    Swofford, D. L. et al. Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. Syst. Biol. 50, 525–539 (2001).

    CAS  PubMed  Google Scholar 

  46. 46

    Yang, Z. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11, 367–372 (1996).

    CAS  PubMed  Google Scholar 

  47. 47

    Philippe, H. et al. Acoelomorph flatworms are deuterostomes related to Xenoturbella. Nature 470, 255–258 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. 48

    Zhong, B. et al. Systematic error in seed plant phylogenomics. Genome Biol. Evol. 3, 1340–1348 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. 49

    Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981). This paper introduces the pruning algorithm for likelihood calculation on a tree. This approach forms the basis for modern likelihood and Bayesian methods of phylogenetic analysis.

    CAS  Google Scholar 

  50. 50

    Yang, Z. Phylogenetic analysis using parsimony and likelihood methods. J. Mol. Evol. 42, 294–307 (1996).

    CAS  PubMed  Google Scholar 

  51. 51

    Felsenstein, J. Phylip: Phylogenetic Inference Program, Version 3.6. (Univ. of Washington, Seattle, 2005).

  52. 52

    Adachi, J. & Hasegawa, M. MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. Comput. Sci. Monogr. 28, 1–150 (1996).

    Google Scholar 

  53. 53

    Guindon, S. & Gascuel, O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696–704 (2003).

    Google Scholar 

  54. 54

    Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690 (2006).

    CAS  Google Scholar 

  55. 55

    Zwickl, D. Genetic Algorithm Approaches for the Phylogenetic Analysis of Large Biological Sequence Datasets Under the Maximum Likelihood Criterion. Thesis, Univ. Texas at Austin (2006).

    Google Scholar 

  56. 56

    Yang, Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39, 306–314 (1994).

    CAS  PubMed  Google Scholar 

  57. 57

    Lartillot, N. & Philippe, H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. 21, 1095–1109 (2004).

    CAS  PubMed  Google Scholar 

  58. 58

    Blanquart, S. & Lartillot, N. A site- and time-heterogeneous model of amino acid replacement. Mol. Biol. Evol. 25, 842–858 (2008).

    CAS  PubMed  Google Scholar 

  59. 59

    Goldman, N. Statistical tests of models of DNA substitution. J. Mol. Evol. 36, 182–198 (1993).

    CAS  PubMed  Google Scholar 

  60. 60

    Zuckerkandl, E. & Pauling, L. in Evolving Genes and Proteins (eds Bryson, V. & Vogel, H. J.) 97–166 (Academic Press, New York, 1965).

    Google Scholar 

  61. 61

    Nielsen, R. & Yang, Z. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148, 929–936 (1998).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. 62

    Yang, Z. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15, 568–573 (1998).

    CAS  Google Scholar 

  63. 63

    Yang, Z. & Nielsen, R. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19, 908–917 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  64. 64

    Huelsenbeck, J. P. & Rannala, B. Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science 276, 227–232 (1997).

    CAS  PubMed  Google Scholar 

  65. 65

    Whelan, S., Liò, P. & Goldman, N. Molecular phylogenetics: state of the art methods for looking into the past. Trends Genet. 17, 262–272 (2001).

    CAS  PubMed  Google Scholar 

  66. 66

    Rannala, B. & Yang, Z. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. J. Mol. Evol. 43, 304–311 (1996).

    CAS  PubMed  Google Scholar 

  67. 67

    Yang, Z. & Rannala, B. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo Method. Mol. Biol. Evol. 14, 717–724 (1997).

    CAS  PubMed  Google Scholar 

  68. 68

    Mau, B. & Newton, M. A. Phylogenetic inference for binary data on dendrograms using Markov chain Monte Carlo. J. Comput. Graph. Stat. 6, 122–131 (1997).

    Google Scholar 

  69. 69

    Li, S., Pearl, D. & Doss, H. Phylogenetic tree reconstruction using Markov chain Monte Carlo. J. Am. Stat. Assoc. 95, 493–508 (2000).

    Google Scholar 

  70. 70

    Larget, B. & Simon, D. L. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 16, 750–759 (1999).

    CAS  Google Scholar 

  71. 71

    Huelsenbeck, J. P. & Ronquist, F. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754–755 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  72. 72

    Drummond, A. J., Ho, S. Y. W., Phillips, M. J. & Rambaut, A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 4, e88 (2006). This paper introduces a Bayesian MCMC algorithm (the BEAST program) to estimate rooted trees under relaxed-clock models.

    PubMed  PubMed Central  Google Scholar 

  73. 73

    Felsenstein, J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791 (1985).

    PubMed  PubMed Central  Google Scholar 

  74. 74

    Felsenstein, J. & Kishino, H. Is there something wrong with the bootstrap on phylogenies? A reply to Hillis and Bull. Syst. Biol. 42, 193–200 (1993).

    Google Scholar 

  75. 75

    Efron, B., Halloran, E. & Holmes, S. Bootstrap confidence levels for phylogenetic trees. Proc. Natl Acad. Sci. USA 93, 7085–7090 (1996); corrected article Proc. Natl Acad. Sci. USA 93, 13429–13434 (1996).

    CAS  PubMed  Google Scholar 

  76. 76

    Berry, V. & Gascuel, O. On the interpretation of bootstrap trees: appropriate threshold of clade selection and induced gain. Mol. Biol. Evol. 13, 999–1011 (1996).

    CAS  Google Scholar 

  77. 77

    Susko, E. First-order correct bootstrap support adjustments for splits that allow hypothesis testing when using maximum likelihood estimation. Mol. Biol. Evol. 27, 1621–1629 (2010).

    CAS  PubMed  Google Scholar 

  78. 78

    Suzuki, Y., Glazko, G. V. & Nei, M. Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics. Proc. Natl Acad. Sci. USA 99, 16138–16143 (2002).

    CAS  PubMed  Google Scholar 

  79. 79

    Lewis, P. O., Holder, M. T. & Holsinger, K. E. Polytomies and Bayesian phylogenetic inference. Syst. Biol. 54, 241–253 (2005).

    PubMed  Google Scholar 

  80. 80

    Yang, Z. & Rannala, B. Branch-length prior influences Bayesian posterior probability of phylogeny. Syst. Biol. 54, 455–470 (2005).

    PubMed  Google Scholar 

  81. 81

    Huelsenbeck, J. P. & Rannala, B. Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. Syst. Biol. 53, 904–913 (2004).

    PubMed  Google Scholar 

  82. 82

    Brown, J. M., Hedtke, S. M., Lemmon, A. R. & Lemmon, E. M. When trees grow too long: investigating the causes of highly inaccurate Bayesian branch-length estimates. Syst. Biol. 59, 145–161 (2010).

    PubMed  Google Scholar 

  83. 83

    Rannala, B., Zhu, T. & Yang, Z. Tail paradox, partial identifiability and influential priors in Bayesian branch length inference. Mol. Biol. Evol. 29, 325–335 (2012).

    CAS  PubMed  Google Scholar 

  84. 84

    Zhang, C., Rannala, B. & Yang, Z. Robustness of compound Dirichlet priors for Bayesian inference of branch lengths. Syst. Biol. 10 Feb 2012 (doi: 10.1093/sysbio/sys030).

    PubMed  Google Scholar 

  85. 85

    Suchard, M. & Rambaut, A. Many-core algorithms for statistical phylogenetics. Bioinformatics 25, 1370–1376 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  86. 86

    Zierke, S. & Bakos, J. FPGA acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods. BMC Bioinform. 11, 184 (2010).

    Google Scholar 

  87. 87

    Bininda-Emonds, O. R. P. Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life (Kluwer Academic, the Netherlands, 2004).

    Google Scholar 

  88. 88

    de Queiroz, A. & Gatesy, J. The supermatrix approach to systematics. Trends Ecol. Evol. 22, 34–41 (2007).

    PubMed  Google Scholar 

  89. 89

    Yang, Z. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 42, 587–596 (1996).

    CAS  PubMed  Google Scholar 

  90. 90

    Shapiro, B., Rambaut, A. & Drummond, A. J. Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. Mol. Biol. Evol. 23, 7–9 (2006).

    CAS  PubMed  Google Scholar 

  91. 91

    Ren, F., Tanaka, H. & Yang, Z. A likelihood look at the supermatrix–supertree controversy. Gene 441, 119–125 (2009).

    CAS  PubMed  Google Scholar 

  92. 92

    Criscuolo, A., Berry, V., Douzery, E. J. & Gascuel, O. SDM: a fast distance-based approach for (super) tree building in phylogenomics. Syst. Biol. 55, 740–755 (2006).

    PubMed  Google Scholar 

  93. 93

    Wiens, J. J. & Moen, D. S. Missing data and the accuracy of Bayesian phylogenetics. J. Syst. Evol. 46, 307–314 (2008).

    Google Scholar 

  94. 94

    Dwivedi, B. & Gadagkar, S. Phylogenetic inference under varying proportions of indel-induced alignment gaps. BMC Evol. Biol. 9, 1471–2148 (2009).

    Google Scholar 

  95. 95

    Rodrigue, N., Philippe, H. & Lartillot, N. Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles. Proc. Natl Acad. Sci. USA 107, 4629–4634 (2010).

    CAS  PubMed  Google Scholar 

  96. 96

    Pagel, M. & Meade, A. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. 53, 571–581 (2004).

    PubMed  Google Scholar 

  97. 97

    Nishihara, H., Okada, N. & Hasegawa, M. Rooting the Eutherian tree — the power and pitfalls of phylogenomics. Genome Biol. 8, R199 (2007).

    PubMed  PubMed Central  Google Scholar 

  98. 98

    Leigh, J. W., Susko, E., Baumgartner, M. & Roger, A. J. Testing congruence in phylogenomic analysis. Syst. Biol. 57, 104–115 (2008).

    Google Scholar 

  99. 99

    Higgins, D. G. & Sharp, P. M. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237–244 (1988).

    CAS  Google Scholar 

  100. 100

    Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  101. 101

    Löytynoja, A. & Goldman, N. An algorithm for progressive multiple alignment of sequences with insertions. Proc. Natl Acad. Sci. USA 102, 10557–10562 (2005).

    PubMed  Google Scholar 

  102. 102

    Löytynoja, A. & Goldman, N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320, 1632–1635 (2008).

    Google Scholar 

  103. 103

    Thorne, J. L., Kishino, H. & Felsenstein, J. An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. 33, 114–124 (1991); erratum J. Mol. Evol. 34, 91 (1992).

    CAS  PubMed  Google Scholar 

  104. 104

    Hein, J., Jensen, J. L. & Pedersen, C. N. Recursions for statistical multiple alignment. Proc. Natl Acad. Sci. USA 100, 14960–14965 (2003).

    CAS  PubMed  Google Scholar 

  105. 105

    Redelings, B. D. & Suchard, M. A. Joint Bayesian estimation of alignment and phylogeny. Syst. Biol. 54, 401–418 (2005).

    PubMed  Google Scholar 

  106. 106

    Lunter, G., Miklos, I., Drummond, A., Jensen, J. L. & Hein, J. Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics 6, 83 (2005).

    PubMed  PubMed Central  Google Scholar 

  107. 107

    Thorne, J. L., Kishino, H. & Painter, I. S. Estimating the rate of evolution of the rate of molecular evolution. Mol. Biol. Evol. 15, 1647–1657 (1998). This paper describes the first Bayesian MCMC method for dating species divergence using minimum and maximum bounds to incorporate fossil calibrations.

    CAS  PubMed  Google Scholar 

  108. 108

    Kishino, H., Thorne, J. L. & Bruno, W. J. Performance of a divergence time estimation method under a probabilistic model of rate evolution. Mol. Biol. Evol. 18, 352–361 (2001).

    CAS  Google Scholar 

  109. 109

    Rannala, B. & Yang, Z. Inferring speciation times under an episodic molecular clock. Syst. Biol. 56, 453–466 (2007).

    PubMed  Google Scholar 

  110. 110

    Yang, Z. & Rannala, B. Bayesian estimation of species divergence times under a molecular clock using multiple fossil calibrations with soft bounds. Mol. Biol. Evol. 23, 212–226 (2006).

    CAS  PubMed  Google Scholar 

  111. 111

    Inoue, J., Donoghue, P. C. H. & Yang, Z. The impact of the representation of fossil calibrations on Bayesian estimation of species divergence times. Syst. Biol. 59, 74–89 (2010).

    PubMed  Google Scholar 

  112. 112

    Tavaré, S., Marshall, C. R., Will, O., Soligos, C. & Martin, R. D. Using the fossil record to estimate the age of the last common ancestor of extant primates. Nature 416, 726–729 (2002).

    PubMed  Google Scholar 

  113. 113

    Wilkinson, R. D. et al. Dating primate divergences through an integrated analysis of palaeontological and molecular data. Syst. Biol. 60, 16–31 (2011).

    CAS  PubMed  Google Scholar 

  114. 114

    Knowles, L. L. Statistical phylogeography. Annu. Rev. Ecol. Syst. 40, 593–612 (2009).

    Google Scholar 

  115. 115

    Lemey, P., Rambaut, A., Drummond, A. J. & Suchard, M. A. Bayesian phylogeography finds its roots. PLoS Comp. Biol. 5, e1000520 (2009).

    Google Scholar 

  116. 116

    Lemey, P., Rambaut, A., Welch, J. J. & Suchard, M. A. Phylogeography takes a relaxed random walk in continuous space and time. Mol. Biol. Evol. 27, 1877–1885 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  117. 117

    Takahata, N., Satta, Y. & Klein, J. Divergence time and population size in the lineage leading to modern humans. Theor. Popul. Biol. 48, 198–221 (1995).

    CAS  PubMed  Google Scholar 

  118. 118

    Rannala, B. & Yang, Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164, 1645–1656 (2003). This study describes the multi-species coalescent model. This is the basis for carrying out comparative analyses of individual genomes and phylogeographic studies and for applying species tree methods.

    CAS  PubMed  PubMed Central  Google Scholar 

  119. 119

    Drummond, A. J., Nicholls, G. K., Rodrigo, A. G. & Solomon, W. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics 161, 1307–1320 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  120. 120

    Hey, J. Isolation with migration models for more than two populations. Mol. Biol. Evol. 27, 905–920 (2010).

    CAS  PubMed  Google Scholar 

  121. 121

    Knowles, L. L. & Carstens, B. C. Delimiting species without monophyletic gene trees. Syst. Biol. 56, 887–895 (2007).

    PubMed  Google Scholar 

  122. 122

    Yang, Z. & Rannala, B. Bayesian species delimitation using multilocus sequence data. Proc. Natl Acad. Sci. USA 107, 9264–9269 (2010). This paper describes a Bayesian MCMC method for delimiting species using sequence data from multiple loci under the multi-species coalescent model.

    CAS  PubMed  Google Scholar 

  123. 123

    Rohland, N. et al. Genomic DNA sequences from mastodon and woolly mammoth reveal deep speciation of forest and savanna elephants. PLoS Biol. 8, e1000564 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  124. 124

    Bos, K. I. et al. A draft genome of Yersinia pestis from victims of the Black Death. Nature 478, 506–510 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  125. 125

    Patterson, N., Richter, D. J., Gnerre, S., Lander, E. S. & Reich, D. Genetic evidence for complex speciation of humans and chimpanzees. Nature 441, 1103–1108 (2006).

    CAS  PubMed  Google Scholar 

  126. 126

    Innan, H. & Watanabe, H. The effect of gene flow on the coalescent time in the human–chimpanzee ancestral population. Mol. Biol. Evol. 23, 1040–1047 (2006).

    CAS  PubMed  Google Scholar 

  127. 127

    Becquet, C. & Przeworski, M. A new approach to estimate parameters of speciation models with application to apes. Genome Res. 17, 1505–1519 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  128. 128

    Hobolth, A., Christensen, O. F., Mailund, T. & Schierup, M. H. Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS Genet. 3, e7 (2007).

    PubMed  PubMed Central  Google Scholar 

  129. 129

    Burgess, R. & Yang, Z. Estimation of hominoid ancestral population sizes under Bayesian coalescent models incorporating mutation rate variation and sequencing errors. Mol. Biol. Evol. 25, 1979–1994 (2008).

    CAS  PubMed  Google Scholar 

  130. 130

    Becquet, C. & Przeworski, M. Learning about modes of speciation by computational approaches. Evolution 63, 2547–2562 (2009).

    PubMed  Google Scholar 

  131. 131

    Yang, Z. A likelihood ratio test of speciation with gene flow using genomic sequence data. Genome Biol. Evol. 2, 200–211 (2010).

    PubMed  PubMed Central  Google Scholar 

  132. 132

    Reich, D. et al. Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468, 1053–1060 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  133. 133

    Sitnikova, T., Rzhetsky, A. & Nei, M. Interior-branch and bootstrap tests of phylogenetic trees. Mol. Biol. Evol. 12, 319–333 (1995).

    CAS  PubMed  Google Scholar 

  134. 134

    Zhong, B., Yonezawa, T., Zhong, Y. & Hasegawa, M. The position of gnetales among seed plants: overcoming pitfalls of chloroplast phylogenomics. Mol. Biol. Evol. 27, 2855–2863 (2010).

    CAS  PubMed  Google Scholar 

  135. 135

    Drummond, A. J. & Rambaut, A. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol. 7, 214 (2007).

    PubMed  PubMed Central  Google Scholar 

  136. 136

    Kosakovsky Pond, S. L., Frost, S. D. W. & Muse, S. V. HyPhy: hypothesis testing using phylogenies. Bioinformatics 21, 676–679 (2005).

    Google Scholar 

  137. 137

    Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  138. 138

    Lartillot, N. & Philippe, H. Computing Bayes factors using thermodynamic integration. Syst. Biol. 55, 195–207 (2006).

    PubMed  Google Scholar 

  139. 139

    Xie, W., Lewis, P. O., Fan, Y., Kuo, L. & Chen, M.-H. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Syst. Biol. 60, 150–160 (2011).

    PubMed  Google Scholar 

Download references

Acknowledgements

We thank the three referees for their constructive comments and M. Hasegawa and B. Zhong for providing the seed-plant phylogenies of Fig. 3. Z.Y. is supported by a UK Biotechnology and Biological Sciences Research Council grant and a Royal Society Wolfson Research Merit Award. B.R. is supported by a US National Institutes of Health grant.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ziheng Yang.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

Ziheng Yang's homepage

Bruce Rannala's homepage

A comprehensive list of phylogenetic programs maintained by Joe Felsenstein

Nature Reviews Genetics article series on Study designs

Glossary

Systematics

The inference of phylogenetic relationships among species and the use of such information to classify species.

Taxonomy

The description, classification and naming of species.

Coalescent

The process of joining ancestral lineages when the genealogical relationships of a random sample of sequences from a modern population are traced back.

Gene trees

The phylogenetic or genealogical tree of sequences at a gene locus or genomic region.

Statistical phylogeography

The statistical analysis of population data from closely related species to infer population parameters and processes such as population sizes, demography, migration patterns and rates.

Species tree

A phylogenetic tree for a set of species that underlies the gene trees at individual loci.

Systematic errors

Errors that are due to an incorrect model assumption. They are exacerbated when the data size increases.

Random sampling errors

Errors or uncertainties in parameter estimates owing to limited data.

Cluster algorithm

An algorithm of assigning a set of individuals to groups (or clusters) so that objects of the same cluster are more similar to each other than those from different clusters. Hierarchical cluster analysis can be agglomerative (starting with single elements and successively joining them into clusters) or divisive (starting with all objects and successively dividing them into partitions).

Markov chain

A stochastic sequence (or chain) of states with the property that, given the current state, the probabilities for the next state do not depend on the past states.

Transitions

Substitutions between the two pyrimidines (T↔C) or between the two purines (A↔G).

Transversions

Substitutions between a pyrimidine and a purine (T or C↔A or G).

Unrooted trees

Phylogenetic trees for which the location of the root is unspecified.

Long-branch attraction

The phenomenon of inferring an incorrect tree with long branches grouped together by parsimony or by model-based methods under simplistic models.

Likelihood ratio test

A general hypothesis-testing method that uses the likelihood to compare two nested hypotheses, often using the χ2 distribution to assess significance.

Molecular clock

The hypothesis or observation that the evolutionary rate is constant over time or across lineages.

Prior distribution

The distribution assigned to parameters before the analysis of the data.

Posterior distribution

The distribution of the parameters (or models) conditional on the data. It combines the information in the prior and in the data (likelihood).

Markov chain Monte Carlo algorithms

(MCMC algorithms). A Monte Carlo simulation is a computer simulation of a biological process using random numbers. An MCMC algorithm is a Monte Carlo simulation algorithm that generates a sample from a target distribution (often a Bayesian posterior distribution).

Clades

Groups of species that have descended from a common ancestor.

Graphical processing units

(GPU). Specialized units that are traditionally used to manipulate output on a video display and have recently been explored for use in parallel computation.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Yang, Z., Rannala, B. Molecular phylogenetics: principles and practice. Nat Rev Genet 13, 303–314 (2012). https://doi.org/10.1038/nrg3186

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing