Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Phylogenetic tree building in the genomic age

Abstract

Knowing phylogenetic relationships among species is fundamental for many studies in biology. An accurate phylogenetic tree underpins our understanding of the major transitions in evolution, such as the emergence of new body plans or metabolism, and is key to inferring the origin of new genes, detecting molecular adaptation, understanding morphological character evolution and reconstructing demographic changes in recently diverged species. Although data are ever more plentiful and powerful analysis methods are available, there remain many challenges to reliable tree building. Here, we discuss the major steps of phylogenetic analysis, including identification of orthologous genes or proteins, multiple sequence alignment, and choice of substitution models and inference methodologies. Understanding the different sources of errors and the strategies to mitigate them is essential for assembling an accurate tree of life.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Phylogenomic pipeline.
Fig. 2: Distinguishing orthologous and paralogous relationships between genes.
Fig. 3: Heterogeneous rates across lineages and long-branch attraction.
Fig. 4: Heterogeneous substitution rates and patterns across sites.
Fig. 5: Heterogeneities across time or lineages.
Fig. 6: Homogeneous partition and mixture models.
Fig. 7: Gene-tree–species-tree incongruence.

Similar content being viewed by others

References

  1. Delsuc, F., Brinkmann, H. & Philippe, H. Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. 6, 361–375 (2005).

    Article  CAS  PubMed  Google Scholar 

  2. Telford, M. J. & Budd, G. E. The place of phylogeny and cladistics in Evo-Devo research. Int. J. Dev. Biol. 47, 479–490 (2003).

    PubMed  Google Scholar 

  3. Fitch, W. M. & Margoliash, E. Construction of phylogenetic trees. Science 155, 279–284 (1967).

    Article  CAS  PubMed  Google Scholar 

  4. Darwin, C. R. Darwin Correspondence Project, ‘Letter no. 2143’. https://www.darwinproject.ac.uk/letter/DCP-LETT-2143.xml.

  5. Field, K. G. et al. Molecular phylogeny of the animal kingdom. Science 239, 748–753 (1988).

    Article  CAS  PubMed  Google Scholar 

  6. Aguinaldo, A. M. A. et al. Evidence for a clade of nematodes, arthropods and other moulting animals. Nature 387, 489–493 (1997). Classic paper on LBA that shows the benefit of excluding long-branch taxa.

    Article  CAS  PubMed  Google Scholar 

  7. Telford, M. J., Budd, G. E. & Philippe, H. Phylogenomic insights into animal evolution. Curr. Biol. 25, R876–R887 (2015).

    Article  CAS  PubMed  Google Scholar 

  8. Lewin, H. A. et al. Earth BioGenome project: sequencing life for the future of life. Proc. Natl Acad. Sci. USA 115, 4325–4333 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Woese, C. R. & Fox, G. E. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl Acad. Sci. USA 74, 5088–5090 (1977).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Kocher, T. D. et al. Dynamics of mitochondrial DNA evolution in animals: amplification and sequencing with conserved primers. Proc. Natl Acad. Sci. USA 86, 6196–6200 (1989).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Philippe, H. & Telford, M. J. Large-scale sequencing and the new animal phylogeny. Trends Ecol. Evol. 21, 614–620 (2006).

    Article  PubMed  Google Scholar 

  12. Hoff, K. J. & Stanke, M. Predicting genes in single genomes with AUGUSTUS. Curr. Protoc. Bioinformatics 65, e57 (2019).

    PubMed  Google Scholar 

  13. Laetsch, D. R. & Blaxter, M. L. BlobTools: interrogation of genome assemblies. F1000Research 6, 1287 (2017).

    Article  Google Scholar 

  14. Simion, P. et al. A software tool ‘CroCo’ detects pervasive cross-species contamination in next generation sequencing data. BMC Biol. 16, 28 (2018). This article identifies cross contamination between multiplexed sequence samples as a frequent occurrence and provides the means to detect this source of error.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  15. Fitch, W. M. Distinguishing homologous from analogous proteins. Syst. Zool. 19, 99–113 (1970). Original paper defining different forms of homology.

    Article  CAS  PubMed  Google Scholar 

  16. Kristensen, D. M., Wolf, Y. I., Mushegian, A. R. & Koonin, E. V. Computational methods for gene orthology inference. Brief. Bioinformatics 12, 379–391 (2011).

    Article  PubMed  Google Scholar 

  17. Koonin, E. V. Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 39, 309–338 (2005).

    Article  CAS  PubMed  Google Scholar 

  18. Trachana, K. et al. Orthology prediction methods: a quality assessment using curated protein families. BioEssays 33, 769–780 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Li, H. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 34 (Database issue), D572–D580 (2006).

    Article  CAS  PubMed  Google Scholar 

  20. Huerta-Cepas, J., Capella-Gutiérrez, S., Pryszcz, L. P., Marcet-Houben, M. & Gabaldón, T. PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome. Nucleic Acids Res. 42 (Database issue), D897–D902 (2014).

    Article  CAS  PubMed  Google Scholar 

  21. Mi, H., Muruganujan, A. & Thomas, P. D. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 41 (Database issue), D377–D386 (2013).

    Google Scholar 

  22. Glover, N. et al. Advances and applications in the quest for orthologs. Mol. Biol. Evol. 36, 2157–2164 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Boeckmann, B. et al. Quest for orthologs entails quest for tree of life: in search of the gene stream. Genome Biol. Evol. 7, 1988–1999 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Harpak, A., Lan, X., Gao, Z. & Pritchard, J. K. Frequent nonallelic gene conversion on the human lineage and its effect on the divergence of gene duplicates. Proc. Natl Acad. Sci. USA 114, 12779–12784 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Li, L., Stoeckert, C. J. Jr. & Roos, D. S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Emms, D. M. & Kelly, S. OrthoFinder: Phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Altenhoff, A. M. et al. OMA standalone: Orthology inference among public and custom genomes and transcriptomes. Genome Res. 29, 1152–1163 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Kaduk, M., Riegler, C., Lemp, O. & Sonnhammer, E. L. L. HieranoiDB: a database of orthologs inferred by Hieranoid. Nucleic Acids Res. 45, D687–D690 (2017).

    Article  CAS  PubMed  Google Scholar 

  29. Kriventseva, E. V. et al. OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic Acids Res. 47, D807–D811 (2019).

    Article  CAS  PubMed  Google Scholar 

  30. Mushegian, A. R. & Koonin, E. V. A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc. Natl Acad. Sci. USA 93, 10268–10273 (1996).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Overbeek, R., Fonstein, M., D’Souza, M., Push, G. D. & Maltsev, N. The use of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. USA 96, 2896–2901 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Wall, D. P., Fraser, H. B. & Hirsh, A. E. Detecting putative orthologs. Bioinformatics 19, 1710–1711 (2003).

    Article  CAS  PubMed  Google Scholar 

  33. Dessimoz, C., Boeckmann, B., Roth, A. C. J. & Gonnet, G. H. Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits. Nucleic Acids Res. 34, 3309–3316 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  PubMed  Google Scholar 

  35. Altenhoff, A. M. et al. The OMA orthology database in 2018: retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces. Nucleic Acids Res. 46, D477–D485 (2018).

    Article  CAS  PubMed  Google Scholar 

  36. Van Bel, M. et al. PLAZA 4.0: an integrative resource for functional, evolutionary and comparative plant genomics. Nucleic Acids Res. 46, D1190–D1196 (2018).

    Article  PubMed  CAS  Google Scholar 

  37. Scornavacca, C. et al. OrthoMaM v10: scaling-up orthologous coding sequence and exon alignments with more than one hundred mammalian genomes. Mol. Biol. Evol. 36, 861–862 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Petersen, M. et al. Orthograph: a versatile tool for mapping coding nucleotide sequences to clusters of orthologous genes. BMC Bioinformatics 18, 111 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  39. Kuzniar, A., van Ham, R. C. H. J., Pongor, S. & Leunissen, J. A. M. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 24, 539–551 (2008).

    Article  CAS  PubMed  Google Scholar 

  40. Szöllősi, G. J., Tannier, E., Daubin, V. & Boussau, B. The inference of gene trees with species trees. Syst. Biol. 64, e42–e62 (2015).

    Article  PubMed  CAS  Google Scholar 

  41. Boussau, B. et al. Genome-scale coestimation of species and gene trees. Genome Res. https://doi.org/10.1101/gr.141978.112 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Wehe, A., Bansal, M. S., Burleigh, J. G. & Eulenstein, O. DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony. Bioinformatics 24, 1540–1541 (2008).

    Article  CAS  PubMed  Google Scholar 

  43. Bansal, M. S., Burleigh, J. G. & Eulenstein, O. Efficient genome-scale phylogenetic analysis under the duplication-loss and deep coalescence cost models. BMC Bioinformatics 11 (Suppl. 1), S42 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  44. Chaudhary, R., Burleigh, J. G. & Fernández-Baca, D. Inferring species trees from incongruent multi-copy gene trees using the Robinson-Foulds distance. Algorithms Mol. Biol. 28, 8 (2013).

    Google Scholar 

  45. Chaudhary, R., Boussau, B., Burleigh, J. G. & Fernández-Baca, D. Assessing approaches for inferring species trees from multi-copy genes. Syst. Biol. 64, 325–339 (2015).

    Article  CAS  PubMed  Google Scholar 

  46. Scornavacca, C. & Galtier, N. Incomplete lineage sorting in mammalian phylogenomics. Syst. Biol. 66, 112–120 (2017).

    CAS  PubMed  Google Scholar 

  47. Sonnhammer, E. L. L. et al. Big data and other challenges in the quest for orthologs. Bioinformatics 30, 2993–2998 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Higgins, D. G. & Sharp, P. M. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237–244 (1988).

    Article  CAS  PubMed  Google Scholar 

  49. Abascal, F., Zardoya, R. & Telford, M. J. TranslatorX: Multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res. 38, W7–W13 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Dessimoz, C. & Gil, M. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biol. 11, R37 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  51. Hall, B. G. Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol. Biol. Evol. 22, 792–802 (2005).

    Article  CAS  PubMed  Google Scholar 

  52. Edgar, R. C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Sievers, F. & Higgins, D. G. Clustal Omega. Curr. Protoc. Bioinformatics 48, 3–13 (2014).

    Article  PubMed  Google Scholar 

  54. Katoh, K., Kuma, K. I., Toh, H. & Miyata, T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33, 511–518 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Notredame, C., Higgins, D. G. & Heringa, J. T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000).

    Article  CAS  PubMed  Google Scholar 

  56. Do, C. B., Mahabhashyam, M. S. P., Brudno, M. & Batzoglou, S. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 15, 330–340 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Chatzou, M. et al. Multiple sequence alignment modeling: methods and applications. Brief. Bioinformatics 17, 1009–1023 (2016).

    Article  CAS  PubMed  Google Scholar 

  58. Suchard, M. A. & Redelings, B. D. BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics 22, 2047–2048 (2006).

    Article  CAS  PubMed  Google Scholar 

  59. Novák, Á., Miklós, I., Lyngsø, R. & Hein, J. StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics 24, 2403–2404 (2008).

    Article  PubMed  CAS  Google Scholar 

  60. Thorne, J. L., Kishino, H. & Felsenstein, J. An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. 33, 114–124 (1991).

    Article  CAS  PubMed  Google Scholar 

  61. Lunter, G., Miklós, I., Drummond, A., Jensen, J. L. & Hein, J. Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics 6, 83 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  62. Löytynoja, A. & Goldman, N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320, 1632–1635 (2008).

    Article  PubMed  CAS  Google Scholar 

  63. Vialle, R. A., Tamuri, A. U. & Goldman, N. Alignment modulates ancestral sequence reconstruction accuracy. Mol. Biol. Evol. 35, 1783–1797 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Simion, P. et al. A large and consistent phylogenomic dataset supports sponges as the sister group to all other animals. Curr. Biol. 27, 958–967 (2017).

    Article  CAS  PubMed  Google Scholar 

  65. Philippe, H. et al. Mitigating anticipated effects of systematic errors supports sister-group relationship between Xenacoelomorpha and Ambulacraria. Curr. Biol. 29, 1818–1826 (2019).

    Article  CAS  PubMed  Google Scholar 

  66. Struck, T. H. Trespex-detection of misleading signal in phylogenetic reconstructions based on tree information. Evol. Bioinformatics 10, 51–67 (2014).

    Article  CAS  Google Scholar 

  67. De Vienne, D. M., Ollier, S. & Aguileta, G. Phylo-MCOA: a fast and efficient method to detect outlier genes and species in phylogenomics using multiple co-inertia analysis. Mol. Biol. Evol. 29, 1587–1598 (2012).

    Article  PubMed  CAS  Google Scholar 

  68. Mai, U. & Mirarab, S. TreeShrink: Fast and accurate detection of outlier long branches in collections of phylogenetic trees. BMC Genomics 19, 272 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  69. Ogden, T. H. & Rosenberg, M. S. Multiple sequence alignment accuracy and phylogenetic inference. Syst. Biol. 55, 314–328 (2006).

    Article  PubMed  Google Scholar 

  70. Fletcher, W. & Yang, Z. The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection. Mol. Biol. Evol. 27, 2257–2267 (2010).

    Article  CAS  PubMed  Google Scholar 

  71. Castresana, J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17, 540–552 (2000).

    Article  CAS  PubMed  Google Scholar 

  72. Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. TrimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  73. Misof, B. & Misof, K. A Monte Carlo approach successfully identifies randomness in multiple sequence alignments: a more objective means of data exclusion. Syst. Biol. 58, 21–34 (2009).

    Article  CAS  PubMed  Google Scholar 

  74. Moretti, S. et al. The M-Coffee web server: A meta-method for computing multiple sequence alignments by combining alternative alignment methods. Nucleic Acids Res. 35, W645–W648 (2007).

    Article  PubMed  PubMed Central  Google Scholar 

  75. Tan, G. et al. Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference. Syst. Biol. 64, 778–791 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Talavera, G. & Castresana, J. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst. Biol. 56, 564–577 (2007).

    Article  CAS  PubMed  Google Scholar 

  77. Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987).

    CAS  PubMed  Google Scholar 

  78. Gascuel, O. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14, 685–695 (1997).

    Article  CAS  PubMed  Google Scholar 

  79. Saitou, N. Introduction to Evolutionary Genomics (Springer, 2018) https://doi.org/10.1007/978-3-319-92642-1.

  80. Wheeler, T. J. in Lecture Notes in Computer Science. (eds Salzberg, S.L. & Warnow, T.) 375–389 (Springer, 2009). https://doi.org/10.1007/978-3-642-04241-6_31.

  81. Felsenstein, J. Inferring Phylogenies (Sinauer Associates, 2004).

  82. Yang, Z. & Rannala, B. Molecular phylogenetics: principles and practice. Nat. Rev. Genet. 13, 303–314 (2012).

    Article  CAS  PubMed  Google Scholar 

  83. Yang, Z. Molecular Evolution: A Statistical Approach (Oxford University Press, 2014).

  84. Fitch, W. M. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Biol. 20, 406–416 (1971).

    Article  Google Scholar 

  85. Hartigan, J. A. Minimum mutation fits to a given tree. Biometrics https://doi.org/10.2307/2529676 (1973).

    Article  Google Scholar 

  86. Felsenstein, J. Parsimony in systematics: biological and statistical issues. Annu. Rev. Ecol. Syst. 14, 313–333 (1983).

    Article  Google Scholar 

  87. Felsenstein, J. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Biol. 27, 401–410 (1978). Clear explanation and demonstration of the effects of long-branch attraction.

    Article  Google Scholar 

  88. Stuart, A., Arnold, S., Ord, J. K., O’Hagan, A. & Forster, J. Kendall’s advanced theory of statistics (Wiley, 1994).

  89. Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981).

    Article  CAS  PubMed  Google Scholar 

  90. Yang, Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13, 555–556 (1997).

    CAS  PubMed  Google Scholar 

  91. Guindon, S. et al. PhyML 3.0. Syst. Biol. 59, 307–321 (2010).

    Article  CAS  PubMed  Google Scholar 

  92. Kozlov, A. M. et al. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics https://doi.org/10.1093/bioinformatics/btz305 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  93. Nguyen, L. T., Schmidt, H. A., Von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).

    Article  CAS  PubMed  Google Scholar 

  94. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2 - Approximately maximum-likelihood trees for large alignments. PLoS One 5, e9490 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  95. Rannala, B. & Yang, Z. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. J. Mol. Evol. 43, 304–311 (1996). This article introduces Bayesian methods to phylogenetics.

    Article  CAS  PubMed  Google Scholar 

  96. Li, S., Pearl, D. K. & Doss, H. Phylogenetic tree construction using Markov chain Monte Carlo. J. Am. Stat. Assoc. 95, 493–508 (2000).

    Article  Google Scholar 

  97. Mau, B. & Newton, M. A. Phylogenetic Inference for binary data on dendograms using Markov chain Monte Carlo. J. Comput. Graph. Stat. 6, 122–131 (1997).

    Google Scholar 

  98. Huelsenbeck, J. P. & Ronquist, F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754–755 (2001).

    Article  CAS  PubMed  Google Scholar 

  99. Höhna, S. et al. RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language. Syst. Biol. 65, 726–736 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  100. Suchard, M. A. et al. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol. 4, vey016 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  101. Bouckaert, R. et al. BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 15, e1006650 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  102. Lartillot, N., Lepage, T. & Blanquart, S. PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics 25, 2286–2288 (2009). Implementation of the CAT model that accommodates site heterogenous evolution in a Bayesian framework.

    Article  CAS  PubMed  Google Scholar 

  103. Lartillot, N., Rodrigue, N., Stubbs, D. & Richer, J. Phylobayes MPI: phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. Syst. Biol. 62, 611–615 (2013).

    Article  CAS  PubMed  Google Scholar 

  104. Huelsenbeck, J. P. & Rannala, B. Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. Syst. Biol. 53, 904–913 (2004).

    Article  PubMed  Google Scholar 

  105. Chen, M.-H., Kuo, L. & Lewis, P. (eds) Bayesian Phylogenetics: Methods, Algorithms, and Applications (Chapman and Hall/CRC, 2014).

  106. Felsenstein, J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783 (1985).

    Article  PubMed  Google Scholar 

  107. Susko, E. Bootstrap support is not first-order correct. Syst. Biol. 58, 211–223 (2009).

    Article  PubMed  Google Scholar 

  108. Yang, Z. & Zhu, T. Bayesian selection of misspecified models is overconfident and may cause spurious posterior probabilities for phylogenetic trees. Proc. Natl Acad. Sci. USA 115, 1854–1859 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  109. Huelsenbeck, J. P. Performance of phylogenetic methods in simulation. Syst. Biol. 44, 17–48 (1995).

    Article  Google Scholar 

  110. Baurain, D., Brinkmann, H. & Philippe, H. Lack of resolution in the animal phylogeny: closely spaced cladogeneses or undetected systematic errors? Mol. Biol. Evol. 24, 6–9 (2007).

    Article  CAS  PubMed  Google Scholar 

  111. Rodréguez-Ezpeleta, N. et al. Detecting and overcoming systematic errors in genome-scale phylogenies. Syst. Biol. 56, 389–399 (2007).

    Article  CAS  Google Scholar 

  112. Brinkmann, H., Van Der Giezen, M., Zhou, Y., Poncelin de Raucourt, G. & Philippe, H. An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Syst. Biol. 54, 743–757 (2005).

    Article  PubMed  Google Scholar 

  113. Rivera-Rivera, C. J. & Montoya-Burgos, J. I. LS3: a method for improving phylogenomic inferences when evolutionary rates are heterogeneous among taxa. Mol. Biol. Evol. 33, 1625–1634 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  114. Lockhart, P. J., Steel, M. A., Hendy, M. D. & Penny, D. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11, 605–612 (1994).

    CAS  PubMed  Google Scholar 

  115. Yang, Z. & Roberts, D. On the use of nucleic acid sequences to infer early branchings in the tree of life. Mol. Biol. Evol. 12, 451–458 (1995).

    CAS  PubMed  Google Scholar 

  116. Foster, P. G. Modeling compositional heterogeneity. Syst. Biol. 53, 485–495 (2004). This article describes a method to detect compositional heterogeneity in sequence alignments.

    Article  PubMed  Google Scholar 

  117. Blanquart, S. & Lartillot, N. A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution. Mol. Biol. Evol. 23, 2058–2071 (2006).

    Article  CAS  PubMed  Google Scholar 

  118. Nesnidal, M. P., Helmkampf, M., Bruchhaus, I. & Hausdorf, B. Compositional heterogeneity and phylogenomic inference of metazoan relationships. Mol. Biol. Evol. 27, 2095–2104 (2010).

    Article  CAS  PubMed  Google Scholar 

  119. Phillips, M. J. & Penny, D. The root of the mammalian tree inferred from whole mitochondrial genomes. Mol. Phylogenet. Evol. 28, 171–185 (2003).

    Article  CAS  PubMed  Google Scholar 

  120. Susko, E. & Roger, A. J. On reduced amino acid alphabets for phylogenetic inference. Mol. Biol. Evol. 24, 2139–2150 (2007).

    Article  CAS  PubMed  Google Scholar 

  121. Yang, Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39, 306–314 (1994).

    Article  CAS  PubMed  Google Scholar 

  122. Yang, Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10, 1396–1401 (1993). This article introduces the gamma distribution to model rate heterogeneity across sites.

    CAS  PubMed  Google Scholar 

  123. Yang, Z. A space-time process model for the evolution of DNA sequences. Genetics 139, 993–1005 (1995).

    CAS  PubMed  PubMed Central  Google Scholar 

  124. Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., Von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  125. Mayrose, I., Friedman, N. & Pupko, T. A gamma mixture model better accounts for among site rate heterogeneity. Bioinformatics 21, 151–158 (2005).

    Article  Google Scholar 

  126. Fitch, W. M. & Markowitz, E. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. 4, 579–593 (1970).

    Article  CAS  PubMed  Google Scholar 

  127. Philippe, H. & Lopez, P. On the conservation of protein sequences in evolution. Trends Biochem. Sci. 26, 414–416 (2001).

    Article  CAS  PubMed  Google Scholar 

  128. Lopez, P., Casane, D. & Philippe, H. Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19, 1–7 (2002). This article introduces the process of heterotachy and effects on tree reconstruction.

    Article  CAS  PubMed  Google Scholar 

  129. Zhou, Y., Rodrigue, N., Lartillot, N. & Philippe, H. Evaluation of the models handling heterotachy in phylogenetic inference. BMC Evol. Biol. 7, 206 (2007).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  130. Kimura, M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111–120 (1980).

    Article  CAS  PubMed  Google Scholar 

  131. Yang, Z., Nielsen, R. & Hasegawa, M. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol. Biol. Evol. 15, 1600–1611 (1998).

    Article  CAS  PubMed  Google Scholar 

  132. Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. in Atlas of Protein Sequence and Structure (ed. Dayhoff, M. O.) 345–352 (National Biomedical Research Foundation, 1978).

  133. Jones, D. T., Taylor, W. R. & Thornton, J. M. The rapid generation of mutation data matrices from protein sequences. Bioinformatics 8, 275–282 (1992).

    Article  CAS  Google Scholar 

  134. Whelan, S. & Goldman, N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18, 691–699 (2001).

    Article  CAS  PubMed  Google Scholar 

  135. Le, S. Q. & Gascuel, O. An improved general amino acid replacement matrix. Mol. Biol. Evol. 25, 1307–1320 (2008).

    Article  CAS  PubMed  Google Scholar 

  136. Dang, C. C., Le, S. Q., Gascuel, O. & Le, V. S. FLU, an amino acid substitution model for influenza proteins. BMC Evol. Biol. 10, 99 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  137. Adachi, J., Waddell, P. J., Martin, W. & Hasegawa, M. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J. Mol. Evol. 50, 348–358 (2000).

    Article  CAS  PubMed  Google Scholar 

  138. Rota-Stabelli, O., Yang, Z. & Telford, M. J. MtZoa: a general mitochondrial amino acid substitutions model for animal evolutionary studies. Mol. Phylogenet. Evol. 52, 268–272 (2009).

    Article  CAS  PubMed  Google Scholar 

  139. Yang, Z. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 42, 587–596 (1996).

    Article  CAS  PubMed  Google Scholar 

  140. Darriba, D. et al. ModelTest-NG: a new and scalable tool for the selection of DNA and protein evolutionary models. Mol. Biol. Evol. 37, 291–294 (2020).

    Article  PubMed  Google Scholar 

  141. Morel, B., Kozlov, A. M. & Stamatakis, A. ParGenes: A tool for massively parallel model selection and phylogenetic tree inference on thousands of genes. Bioinformatics 35, 1771–1773 (2019).

    Article  CAS  PubMed  Google Scholar 

  142. Hoff, M., Orf, S., Riehm, B., Darriba, D. & Stamatakis, A. Does the choice of nucleotide substitution models matter topologically? BMC Bioinformatics 17, 143 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  143. Kainer, D. & Lanfear, R. The effects of partitioning on phylogenetic inference. Mol. Biol. Evol. 32, 1611–1627 (2015).

    Article  CAS  PubMed  Google Scholar 

  144. Darriba, D. & Posada, D. The impact of partitioning on phylogenomic accuracy. bioRxiv https://doi.org/10.1101/023978 (2015).

    Article  Google Scholar 

  145. Goldman, N., Thorne, J. L. & Jones, D. T. Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics 149, 445–458 (1998).

    CAS  PubMed  PubMed Central  Google Scholar 

  146. Le, S. Q., Dang, C. C. & Gascuel, O. Modeling protein evolution with several amino acid replacement matrices depending on site rates. Mol. Biol. Evol. 29, 2921–2936 (2012).

    Article  CAS  PubMed  Google Scholar 

  147. Le, S. Q. & Gascuel, O. Accounting for solvent accessibility and secondary structure in protein phylogenetics is clearly beneficial. Syst. Biol. 59, 277–287 (2010).

    Article  CAS  PubMed  Google Scholar 

  148. Quang le, S., Gascuel, O. & Lartillot, N. Empirical profile mixture models for phylogenetic reconstruction. Bioinformatics 24, 2317–2323 (2008).

    Article  PubMed  CAS  Google Scholar 

  149. Wang, H. C., Li, K., Susko, E. & Roger, A. J. A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny. BMC Evol. Biol. 8, 331 (2008).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  150. Halpern, A. L. & Bruno, W. J. Evolutionary distances for protein-coding sequences: modeling site- specific residue frequencies. Mol. Biol. Evol. 15, 910–917 (1998).

    Article  CAS  PubMed  Google Scholar 

  151. Lartillot, N. & Philippe, H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. 21, 1095–1109 (2004). This article introduces the CAT model to accommodate site heterogeneity.

    Article  CAS  PubMed  Google Scholar 

  152. Wang, H. C., Minh, B. Q., Susko, E. & Roger, A. J. Modeling site heterogeneity with posterior mean site frequency profiles accelerates accurate phylogenomic estimation. Syst. Biol. 67, 216–235 (2018). This article discusses approximate site heterogeneous models for maximum likelihood framework applicable to large datasets.

    Article  CAS  PubMed  Google Scholar 

  153. Susko, E., Lincker, L. & Roger, A. J. Accelerated estimation of frequency classes in site-heterogeneous profile mixture models. Mol. Biol. Evol. 35, 1266–1283 (2018).

    Article  CAS  PubMed  Google Scholar 

  154. Lartillot, N., Brinkmann, H. & Philippe, H. Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol. Biol. 7, S4 (2007).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  155. Maddison, W. P. Gene trees in species trees. Syst. Biol. 46, 523–536 (1997).

    Article  Google Scholar 

  156. Nichols, R. Gene trees and species trees are not the same. Trends Ecol. Evol. 16, 358–364 (2001).

    Article  CAS  PubMed  Google Scholar 

  157. Edwards, S. V. Is a new and general theory of molecular systematics emerging? Evolution 63, 1–19 (2009).

    Article  CAS  PubMed  Google Scholar 

  158. Rannala, B. & Yang, Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164, 1645–1656 (2003). This article introduces the multi-species coalescent model in a Bayesian framework.

    CAS  PubMed  PubMed Central  Google Scholar 

  159. Degnan, J. H. & Rosenberg, N. A. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol. Evolution 24, 332–340 (2009).

    Article  Google Scholar 

  160. Kingman, J. F. C. The coalescent. Stoch. Process. Their Appl. 13, 235–248 (1982).

    Article  Google Scholar 

  161. Xu, B. & Yang, Z. Challenges in species tree estimation under the multispecies coalescent model. Genetics 204, 1353–1368 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  162. Hey, J. Isolation with migration models for more than two populations. Mol. Biol. Evol. 27, 905–920 (2010).

    Article  CAS  PubMed  Google Scholar 

  163. Hey, J. et al. Phylogeny estimation by integration over isolation with migration models. Mol. Biol. Evol. 35, 2805–2818 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  164. Dalquen, D. A., Zhu, T. & Yang, A. Z. Maximum likelihood implementation of an isolation-with-migration model for three species. Syst. Biol. 66, 379–398 (2017).

    PubMed  Google Scholar 

  165. Wen, D. & Nakhleh, L. Coestimating reticulate phylogenies and gene trees from multilocus sequence data. Syst. Biol. 67, 439–457 (2018).

    Article  CAS  PubMed  Google Scholar 

  166. Zhang, C., Ogilvie, H. A., Drummond, A. J. & Stadler, T. Bayesian inference of species networks from multilocus sequence data. Mol. Biol. Evol. 35, 504–517 (2018).

    Article  CAS  PubMed  Google Scholar 

  167. Flouri, T., Jiao, X., Rannala, B. & Yang, Z. A Bayesian implementation of the multispecies coalescent model with introgression for phylogenomic analysis. Mol. Biol. Evol. 37, 1211–1223 (2020).

    Article  PubMed  Google Scholar 

  168. Kubatko, L. in Handbook of Statistical Genomics (eds Balding, D., Moltke, I. & Marioni, J.) 219–245 (Wiley, 2019).

  169. Rannala, B., Edwards, S., Leaché, A. D. & Yang, Z. in Phylogenetics in the Genomic Era 3.3:1–3.3:21 (eds. Scornavacca, C., Delsuc, F. & Galtier, N.) (2020).

  170. Mirarab, S. et al. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30, i541–i548 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  171. Liu, L., Yu, L. & Edwards, S. V. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol. Biol. 10, 302 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  172. Ogilvie, H. A., Bouckaert, R. R. & Drummond, A. J. StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Mol. Biol. Evol. 34, 2101–2114 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  173. Heled, J. & Drummond, A. J. Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27, 570–580 (2010).

    Article  CAS  PubMed  Google Scholar 

  174. Yang, Z. & Rannala, B. Unguided species delimitation using DNA sequence data from multiple loci. Mol. Biol. Evol. 31, 3125–3135 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  175. Flouri, T., Jiao, X., Rannala, B. & Yang, Z. Species tree inference with BPP using genomic sequences and the multispecies coalescent. Mol. Biol. Evol. 35, 2585–2593 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  176. Nascimento, F. F., Reis, M. D. & Yang, Z. A biologist’s guide to Bayesian phylogenetic analysis. Nat. Ecol. Evol. 1, 1446–1454 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  177. Thawornwattana, Y., Dalquen, D. & Yang, Z. Coalescent analysis of phylogenomic data confidently resolves the species relationships in the Anopheles gambiae species complex. Mol. Biol. Evol. 35, 2512–2527 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  178. Shi, C. M. & Yang, Z. Coalescent-based analyses of genomic sequence data provide a robust resolution of phylogenetic relationships among major groups of gibbons. Mol. Biol. Evol. 35, 159–179 (2018).

    Article  CAS  PubMed  Google Scholar 

  179. Mirarab, S., Bayzid, M. S. & Warnow, T. Evaluating summary methods for multilocus species tree estimation in the presence of incomplete lineage sorting. Syst. Biol. 65, 366–380 (2016).

    Article  PubMed  Google Scholar 

  180. Morgan, C. C. et al. Heterogeneous models place the root of the placental mammal phylogeny. Mol. Biol. Evol. 30, 2145–2156 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  181. Zhou, Z. & Zhang, J. Amino acid exchangeabilities vary across the tree of life. Sci. Adv. 5, eaax3124 (2019).

    Article  Google Scholar 

  182. Roch, S., Nute, M. & Warnow, T. Long-Branch attraction in species tree estimation: Inconsistency of partitioned likelihood and topology-based summary methods. Syst. Biol. 68, 281–297 (2019).

    Article  PubMed  Google Scholar 

  183. Kobert, K., Stamatakis, A. & Flouri, T. Efficient detection of repeating sites to accelerate phylogenetic likelihood calculations. Syst. Biol. 66, 205–217 (2017). This article introduces novel methods for substantially improving the computational time of the phylogenetic likelihood function and reducing its memory footprint.

    CAS  PubMed  Google Scholar 

  184. Kobert, K., Flouri, T., Aberer, A. & Stamatakis, A. in Algorithms in Bioinformatics. WABI 2014. Lecture Notes in Computer Science. (eds. Brown, D. & Morgenstern, B.) 204–216 https://doi.org/10.1007/978-3-662-44753-6_16 (Springer, 2014).

  185. Aberer, A. J., Kobert, K. & Stamatakis, A. ExaBayes: massively parallel Bayesian tree inference for the whole-genome era. Mol. Biol. Evol. 31, 2553–2556 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  186. Flouri, T. et al. The phylogenetic likelihood library. Syst. Biol. 64, 356–362 (2015).

    Article  CAS  PubMed  Google Scholar 

  187. Ayres, D. L. et al. BEAGLE 3: improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics. Syst. Biol. 68, 1052–1061 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  188. Rannala, B. & Yang, Z. Efficient Bayesian species tree inference under the multispecies coalescent. Syst. Biol. 66, 823–842 (2017).

    PubMed  PubMed Central  Google Scholar 

  189. Höhna, S. & Drummond, A. J. Guided tree topology proposals for Bayesian phylogenetic inference. Syst. Biol. 61, 1–11 (2012).

    Article  PubMed  Google Scholar 

  190. Baele, G., Lemey, P., Rambaut, A. & Suchard, M. A. Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST. Bioinformatics 33, 1798–1805 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors thank members of the Z. Yang and M. Telford laboratories as well as the three reviewers for valuable feedback on previous versions of the manuscript. The writing of this Review was supported by BBSRC grant reference BB/R016240/1.

Author information

Authors and Affiliations

Authors

Contributions

K.P., Z.Y. and M.J.T. contributed to all aspects of the article.

Corresponding author

Correspondence to Maximilian J. Telford.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information

Nature Reviews Genetics thanks F. Ronquist, M. Suchard and A. von Haeseler for their contribution to the peer review of this work.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Glossary

Homologous

When features, including morphological characters and gene loci, are inherited from a common ancestor, for example, a gene in two species originating from a single ancestral gene.

Orthologous

Homologous sequences that have diverged due to speciation events.

Substitution models

Continuous time Markov Chain probabilistic models that describe changes between nucleotides or amino acids over evolutionary time.

Species tree

A phylogenetic tree for a set of species that underlies the gene trees at individual loci.

Paralogy

Homologous sequences that have diverged due to duplication events so that both copies have descended side by side during the history of an organism.

Xenology

Homologous sequences originating from horizontal gene transfer (also known as lateral gene transfer).

Alignment

Insertion of gaps in homologous sequences so that nucleotides or amino acids in the same column are homologous.

Gene tree

The phylogenetic or genealogical tree of sequences at a gene locus or genomic region.

Systematic errors

Errors due to incorrect model assumptions.

Incomplete lineage sorting

Discordance of gene trees from the species tree due to ancestral polymorphism.

Topology

The branching pattern of a phylogenetic tree indicating relationships between taxa.

Long-branch attraction

(LBA). The phenomenon of inferring an incorrect tree in which taxa with long branches are grouped together.

Clades

A clade is a group of taxa on a tree that includes their most recent common ancestor and all its descendants, also known as a monophyletic group.

Stochastic errors

Errors due to the finite length of sequences in the alignment.

Homogeneous-process model

A model that assumes the same substitution rate or process across alignment sites, taxa and time.

Compositional homogeneity

Homogeneity in nucleotide or amino acid frequencies across lineages of a phylogeny.

Mixture models

Models that assume different substitution rates or processes across sites of the alignment.

Profile mixture models

Models that assume multiple sets of state frequencies for sites (for example, CAT, C10–C60).

Coalescence

The process of lineage joining when one traces the history of a sample of sequences backwards in time.

Genetic drift

The process of random changes in allele frequencies over generations due to the stochastic nature of reproduction.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kapli, P., Yang, Z. & Telford, M.J. Phylogenetic tree building in the genomic age. Nat Rev Genet 21, 428–444 (2020). https://doi.org/10.1038/s41576-020-0233-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41576-020-0233-0

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing