Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

CherryML: scalable maximum likelihood estimation of phylogenetic models

Abstract

Phylogenetic models of molecular evolution are central to numerous biological applications spanning diverse timescales, from hundreds of millions of years involving orthologous proteins to just tens of days relating to single cells within an organism. A fundamental problem in these applications is estimating model parameters, for which maximum likelihood estimation is typically employed. Unfortunately, maximum likelihood estimation is a computationally expensive task, in some cases prohibitively so. To address this challenge, we here introduce CherryML, a broadly applicable method that achieves several orders of magnitude speedup by using a quantized composite likelihood over cherries in the trees. The massive speedup offered by our method should enable researchers to consider more complex and biologically realistic models than previously possible. Here we demonstrate CherryML’s utility by applying it to estimate a general 400 × 400 rate matrix for residue–residue coevolution at contact sites in three-dimensional protein structures; we estimate that using current state-of-the-art methods such as the expectation-maximization algorithm for the same task would take >100,000 times longer.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: CherryML method applied to the LG model.
Fig. 2: CherryML method applied to learn a 400 × 400 coevolution model.

Similar content being viewed by others

Data availability

The LG paper5 training and testing Pfam datasets consisting of 3,912 and 500 families, respectively, are available at http://www.atgc-montpellier.fr/models/index.php?model=lg. The Pfam dataset with structure data from Yang et al.19 consisting of 15,051 families is located at https://files.ipd.uw.edu/pub/trRosetta/training_set.tar.gz. The QMaker9 datasets are available at https://figshare.com/articles/dataset/QMaker-datasets_zip/9768101. Our simulated datasets used for Figs. 1b–d and 2a,b are available on Zenodo at https://zenodo.org/record/7830072#.ZDnPBuzMKTc. Instructions for how to reproduce all results in this paper using the above datasets can be found at: https://github.com/songlab-cal/CherryML.

Code availability

Code for reproducing all results in this paper, as well as code implementing the CherryML method for the LG model and for the coevolution model, is available on GitHub at the repository: https://github.com/songlab-cal/CherryML. The CherryML package allows seamless estimation of rate matrices from MSAs. An end-to-end demonstration on the plant dataset28 with train/test splits from the QMaker work9 is provided in the package’s README. A Code Ocean capsule of the CherryML package is provided at: https://codeocean.com/capsule/1152557/tree.

References

  1. Dayhoff, M. O. & Schwartz, R. M. A model of evolutionary changes in protein. In Atlas of Protein Sequence and Structure, Ch. 22, 345–352 (National Biomedical Research Foundation, 1978).

  2. Jones, D. T., Taylor, W. R. & Thornton, J. M. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8, 275–282 (1992).

    CAS  PubMed  Google Scholar 

  3. Whelan, S. & Goldman, N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18, 691–699 (2001).

    Article  CAS  PubMed  Google Scholar 

  4. Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007).

    Article  CAS  PubMed  Google Scholar 

  5. Le, S. Q. & Gascuel, O. An improved general amino acid replacement matrix. Mol. Biol. Evol. 25, 1307–1320 (2008).

    Article  CAS  PubMed  Google Scholar 

  6. Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 (2010).

    Article  CAS  PubMed  Google Scholar 

  7. Bouckaert, R. et al. BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 15, e1006650 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Minh, B. Q. et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Minh, B. Q., Dang, C. C., Vinh, L. S. & Lanfear, R. QMaker: fast and accurate method to estimate empirical models of protein evolution. Syst. Biol. 70, 1046–1060 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Yang, Z. Maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39, 306–314 (1994).

    Article  CAS  PubMed  Google Scholar 

  11. Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. Modelfinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Holmes, I. A model of indel evolution by finite-state, continuous-time machines. Genetics 216, 1187–1204 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Yeang, C.-H. & Haussler, D. Detecting coevolution in and among protein domains. PLOS Comput. Biol. 3, 1–13 (2007).

    Article  Google Scholar 

  14. Felsenstein, J. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Syst. Biol. 22, 240–249 (1973).

    Article  Google Scholar 

  15. Siepel, A. & Haussler, D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 21, 468–488 (2004).

    Article  CAS  PubMed  Google Scholar 

  16. Klosterman, P. S. et al. XRATE: a fast prototyping, training and annotation tool for phylo-grammars. BMC Bioinform. 7, 428 (2006).

    Article  Google Scholar 

  17. Varin, C., Reid, N. & Firth, D. An overview of composite likelihood methods. Stat. Sin. 21, 5–42 (2011).

    Google Scholar 

  18. Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, 8026–8037 (NeurIPS, 2017).

  19. Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Price, M. N., Dehal, P. S. & Arkin, A. P. Fasttree 2: approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Franzosa, E. A. & Xia, Y. Structural determinants of protein evolution are context-sensitive at the residue level. Mol. Biol. Evol. 26, 2387–2395 (2009).

    Article  CAS  PubMed  Google Scholar 

  22. Echave, J., Spielman, S. J. & Wilke, C. O. Causes of evolutionary rate variation among protein sites. Nat. Rev. Genet. 17, 109–121 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Dang, C., Vinh, L., Gascuel, O., Hazes, B. & Le, Q. Fastmg: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets. BMC Bioinform. 15, 341 (2014).

    Article  Google Scholar 

  24. Canh, N. D., Cao Dang, C., Vinh, L. S., Quang Minh, B. & Hoang, D. T. pQMaker: empirically estimating amino acid substitution models in a parallel environment. In 2020 12th International Conference on Knowledge and Systems Engineering (KSE), 324–329 (2020).

  25. Jumper, J. M. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2015).

  27. Bader, P., Blanes, S. & Casas, F. Computing the matrix exponential with an optimized taylor polynomial approximation. Mathematics 7, 1174 (2019).

    Article  Google Scholar 

  28. Ran, J., Shen, T.-T., Wang, M.-M. & Wang, X.-Q. Phylogenomics resolves the deep phylogeny of seed plants and indicates partial convergent or homoplastic evolution between gnetales and angiosperms. Proc. R. Soc. B Biol. Sci. 285, 20181012 (2018).

    Article  Google Scholar 

Download references

Acknowledgements

S.P. acknowledges A. Umfurer for many helpful discussions on software design. We acknowledge the reviewers for their helpful feedback. We thank O. Gascuel for suggesting that we pair up more sequences beyond the original cherries in the trees, which empirically boosted the statistical efficiency of the method by around 10–30% (estimated). We also thank I. Holmes, J. Huelsenbeck, N. Thomas and W. DeWitt for helpful discussions. This research is supported in part by a National Institutes of Health grant R35-GM134922 (to Y.S.S.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

S.P. and Y.S.S. conceived and designed the study. S.P. developed, implemented and tested the method, with assistance from Y.D., P.B., X.L. and P.-Y.C. P.B. contributed to implementing optimization with PyTorch, while X.L. and P.-Y.C. contributed to parallelizing and optimizing the computation of count matrices and the simulations. Y.D. helped with testing the method. S.P. and Y.S.S. analyzed the coevolution model. S.P. and Y.S.S. wrote the manuscript and Y.D. made edits. Y.S.S. supervised the project.

Corresponding author

Correspondence to Yun S. Song.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Olivier Gascuel, Alexandros Stamatakis and the other, anonymous, reviewer for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Plot of true versus estimated rate matrix entries for Fig. 1b,c.

For a select number of families (the multiples of 4), we plot the true versus estimated rate matrix entries. MRE stands for median relative error; ρ is Spearman’s rank correlation; r is Pearson correlation. For reference, we also indicate the total number of sequences, sites and residues in each dataset. As more data become available, estimation accuracy increases for both methods. Importantly, the loss of statistical efficiency of CherryML with respect to EM is relatively small (an estimated ≈ 50% as seen in Fig.1c). Interestingly, with small dataset sizes, the smallest transition rates tend to be underestimated by both methods, possibly because no (or relatively too few) transitions between these states are observed.

Extended Data Fig. 2 Plot of rate matrix estimates from Fig. 1e.

We see that the entries of the LG rate matrix5 and re-estimates with CherryML and EM are quite similar. The only noticeable differences correspond to four of the rarest (and thus harder to estimate) rates, which are between C and E, and C and K (in both directions). It is possible that CherryML is underestimating these rates, but since there is no ground truth and the model may be misspecified (as these are real data estimates), we cannot say with certainty; bootstrap estimates of variance could partially answer this question. For reference, the dataset size in terms of number of families, sequences, sites and residues is: 3,412, ≈ 50,000, ≈ 600,000 and ≈ 6.5M respectively. In principle, this is roughly comparable in size to 256 families in the (well-specified) simulations from Fig. 1b,c, where both EM and CherryML are accurate even for small rates, as seen in Extended Data Fig. 1. However, these direct comparisons of dataset size might overestimate the information content of real datasets. Indeed, it is possible that the effective amount of information for these small rates is more comparable to 16 families in the simulations from Fig. 1b,c, where CherryML produces more underestimates than does EM.

Extended Data Fig. 3 CherryML matches EM accuracy on diverse datasets.

On diverse datasets from the QMaker paper9, CherryML matches the accuracy of the EM method. The end-to-end runtime of each approach (including tree estimation) is shown. The runtime of the CherryML optimizer was in all cases negligible (less than 5 minutes), therefore end-to-end runtime was dominated by phylogeny reconstruction with FastTree, which took a few CPU hours depending on the dataset. In contrast, for the EM approach, the EM optimizer dominated runtime, leading to an overall slowdown of 5-20 fold in end-to-end runtime compared to the CherryML approach. Since tree estimation is embarrassingly parallel, end-to-end estimation with the CherryML method using 32 CPU cores takes only a few minutes on all of these datasets. The diversity of the datasets means that LG is no longer the best fit rate matrix compared to JTT and WAG. In fact, JTT is preferred in three of these datasets. This highlights the need to estimate new rate matrices for improved phylogenetic inference in specific applications9. Training dataset sizes are included for reference.

Extended Data Fig. 4 Plot of rate matrix estimates from Extended Data Fig 3.

Similarly to Extended Data Fig. 2, CherryML and EM agree on most rates, except for some of the smallest (harder to estimate) rates, where CherryML usually reports smaller rates. It is possible that these are underestimates from CherryML, for instance if the information content for these small rates is similar to 16 families in Extended Data Fig. 1, where CherryML produces more underestimates compared to EM.

Extended Data Fig. 5 Comparison of mutation rates.

(a) Our 400 × 400 coevolution model. (b) Independent-sites model.

Extended Data Fig. 6 Comparison of stationary distributions.

(a) Our 400 × 400 coevolution model. (b) Independent-sites model.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Prillo, S., Deng, Y., Boyeau, P. et al. CherryML: scalable maximum likelihood estimation of phylogenetic models. Nat Methods 20, 1232–1236 (2023). https://doi.org/10.1038/s41592-023-01917-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-023-01917-9

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics