CherryML: scalable maximum likelihood estimation of phylogenetic models

Prillo, Sebastian; Deng, Yun; Boyeau, Pierre; Li, Xingyu; Chen, Po-Yen; Song, Yun S.

doi:10.1038/s41592-023-01917-9

Article
Published: 29 June 2023

CherryML: scalable maximum likelihood estimation of phylogenetic models

Sebastian Prillo¹,
Yun Deng²,
Pierre Boyeau¹,
Xingyu Li¹,
Po-Yen Chen¹ &
…
Yun S. Song ORCID: orcid.org/0000-0002-0734-9868^1,3

Nature Methods volume 20, pages 1232–1236 (2023)Cite this article

2775 Accesses
19 Altmetric
Metrics details

Subjects

Abstract

Phylogenetic models of molecular evolution are central to numerous biological applications spanning diverse timescales, from hundreds of millions of years involving orthologous proteins to just tens of days relating to single cells within an organism. A fundamental problem in these applications is estimating model parameters, for which maximum likelihood estimation is typically employed. Unfortunately, maximum likelihood estimation is a computationally expensive task, in some cases prohibitively so. To address this challenge, we here introduce CherryML, a broadly applicable method that achieves several orders of magnitude speedup by using a quantized composite likelihood over cherries in the trees. The massive speedup offered by our method should enable researchers to consider more complex and biologically realistic models than previously possible. Here we demonstrate CherryML’s utility by applying it to estimate a general 400 × 400 rate matrix for residue–residue coevolution at contact sites in three-dimensional protein structures; we estimate that using current state-of-the-art methods such as the expectation-maximization algorithm for the same task would take >100,000 times longer.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: CherryML method applied to the LG model.**

**Fig. 2: CherryML method applied to learn a 400 × 400 coevolution model.**

A sequence-based evolutionary distance method for Phylogenetic analysis of highly divergent proteins

Article Open access 20 November 2023

Large multiple sequence alignments with a root-to-leaf regressive method

Article 02 December 2019

Generation of accurate, expandable phylogenomic trees with uDance

Article 27 July 2023

Data availability

The LG paper⁵ training and testing Pfam datasets consisting of 3,912 and 500 families, respectively, are available at http://www.atgc-montpellier.fr/models/index.php?model=lg. The Pfam dataset with structure data from Yang et al.¹⁹ consisting of 15,051 families is located at https://files.ipd.uw.edu/pub/trRosetta/training_set.tar.gz. The QMaker⁹ datasets are available at https://figshare.com/articles/dataset/QMaker-datasets_zip/9768101. Our simulated datasets used for Figs. 1b–d and 2a,b are available on Zenodo at https://zenodo.org/record/7830072#.ZDnPBuzMKTc. Instructions for how to reproduce all results in this paper using the above datasets can be found at: https://github.com/songlab-cal/CherryML.

Code availability

Code for reproducing all results in this paper, as well as code implementing the CherryML method for the LG model and for the coevolution model, is available on GitHub at the repository: https://github.com/songlab-cal/CherryML. The CherryML package allows seamless estimation of rate matrices from MSAs. An end-to-end demonstration on the plant dataset²⁸ with train/test splits from the QMaker work⁹ is provided in the package’s README. A Code Ocean capsule of the CherryML package is provided at: https://codeocean.com/capsule/1152557/tree.

References

Dayhoff, M. O. & Schwartz, R. M. A model of evolutionary changes in protein. In Atlas of Protein Sequence and Structure, Ch. 22, 345–352 (National Biomedical Research Foundation, 1978).
Jones, D. T., Taylor, W. R. & Thornton, J. M. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8, 275–282 (1992).
CAS PubMed Google Scholar
Whelan, S. & Goldman, N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18, 691–699 (2001).
Article CAS PubMed Google Scholar
Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007).
Article CAS PubMed Google Scholar
Le, S. Q. & Gascuel, O. An improved general amino acid replacement matrix. Mol. Biol. Evol. 25, 1307–1320 (2008).
Article CAS PubMed Google Scholar
Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 (2010).
Article CAS PubMed Google Scholar
Bouckaert, R. et al. BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 15, e1006650 (2019).
Article CAS PubMed PubMed Central Google Scholar
Minh, B. Q. et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020).
Article CAS PubMed PubMed Central Google Scholar
Minh, B. Q., Dang, C. C., Vinh, L. S. & Lanfear, R. QMaker: fast and accurate method to estimate empirical models of protein evolution. Syst. Biol. 70, 1046–1060 (2021).
Article PubMed PubMed Central Google Scholar
Yang, Z. Maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39, 306–314 (1994).
Article CAS PubMed Google Scholar
Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. Modelfinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).
Article CAS PubMed PubMed Central Google Scholar
Holmes, I. A model of indel evolution by finite-state, continuous-time machines. Genetics 216, 1187–1204 (2020).
Article CAS PubMed PubMed Central Google Scholar
Yeang, C.-H. & Haussler, D. Detecting coevolution in and among protein domains. PLOS Comput. Biol. 3, 1–13 (2007).
Article Google Scholar
Felsenstein, J. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Syst. Biol. 22, 240–249 (1973).
Article Google Scholar
Siepel, A. & Haussler, D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 21, 468–488 (2004).
Article CAS PubMed Google Scholar
Klosterman, P. S. et al. XRATE: a fast prototyping, training and annotation tool for phylo-grammars. BMC Bioinform. 7, 428 (2006).
Article Google Scholar
Varin, C., Reid, N. & Firth, D. An overview of composite likelihood methods. Stat. Sin. 21, 5–42 (2011).
Google Scholar
Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, 8026–8037 (NeurIPS, 2017).
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).
Article CAS PubMed PubMed Central Google Scholar
Price, M. N., Dehal, P. S. & Arkin, A. P. Fasttree 2: approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Article PubMed PubMed Central Google Scholar
Franzosa, E. A. & Xia, Y. Structural determinants of protein evolution are context-sensitive at the residue level. Mol. Biol. Evol. 26, 2387–2395 (2009).
Article CAS PubMed Google Scholar
Echave, J., Spielman, S. J. & Wilke, C. O. Causes of evolutionary rate variation among protein sites. Nat. Rev. Genet. 17, 109–121 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dang, C., Vinh, L., Gascuel, O., Hazes, B. & Le, Q. Fastmg: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets. BMC Bioinform. 15, 341 (2014).
Article Google Scholar
Canh, N. D., Cao Dang, C., Vinh, L. S., Quang Minh, B. & Hoang, D. T. pQMaker: empirically estimating amino acid substitution models in a parallel environment. In 2020 12th International Conference on Knowledge and Systems Engineering (KSE), 324–329 (2020).
Jumper, J. M. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2015).
Bader, P., Blanes, S. & Casas, F. Computing the matrix exponential with an optimized taylor polynomial approximation. Mathematics 7, 1174 (2019).
Article Google Scholar
Ran, J., Shen, T.-T., Wang, M.-M. & Wang, X.-Q. Phylogenomics resolves the deep phylogeny of seed plants and indicates partial convergent or homoplastic evolution between gnetales and angiosperms. Proc. R. Soc. B Biol. Sci. 285, 20181012 (2018).
Article Google Scholar

Download references

Acknowledgements

S.P. acknowledges A. Umfurer for many helpful discussions on software design. We acknowledge the reviewers for their helpful feedback. We thank O. Gascuel for suggesting that we pair up more sequences beyond the original cherries in the trees, which empirically boosted the statistical efficiency of the method by around 10–30% (estimated). We also thank I. Holmes, J. Huelsenbeck, N. Thomas and W. DeWitt for helpful discussions. This research is supported in part by a National Institutes of Health grant R35-GM134922 (to Y.S.S.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Computer Science Division, University of California, Berkeley, CA, USA
Sebastian Prillo, Pierre Boyeau, Xingyu Li, Po-Yen Chen & Yun S. Song
Graduate Group in Computational Biology, University of California, Berkeley, CA, USA
Yun Deng
Department of Statistics, University of California, Berkeley, CA, USA
Yun S. Song

Authors

Sebastian Prillo
View author publications
You can also search for this author in PubMed Google Scholar
Yun Deng
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Boyeau
View author publications
You can also search for this author in PubMed Google Scholar
Xingyu Li
View author publications
You can also search for this author in PubMed Google Scholar
Po-Yen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yun S. Song
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.P. and Y.S.S. conceived and designed the study. S.P. developed, implemented and tested the method, with assistance from Y.D., P.B., X.L. and P.-Y.C. P.B. contributed to implementing optimization with PyTorch, while X.L. and P.-Y.C. contributed to parallelizing and optimizing the computation of count matrices and the simulations. Y.D. helped with testing the method. S.P. and Y.S.S. analyzed the coevolution model. S.P. and Y.S.S. wrote the manuscript and Y.D. made edits. Y.S.S. supervised the project.

Corresponding author

Correspondence to Yun S. Song.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Olivier Gascuel, Alexandros Stamatakis and the other, anonymous, reviewer for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Plot of true versus estimated rate matrix entries for Fig. 1b,c.

For a select number of families (the multiples of 4), we plot the true versus estimated rate matrix entries. MRE stands for median relative error; ρ is Spearman’s rank correlation; r is Pearson correlation. For reference, we also indicate the total number of sequences, sites and residues in each dataset. As more data become available, estimation accuracy increases for both methods. Importantly, the loss of statistical efficiency of CherryML with respect to EM is relatively small (an estimated ≈ 50% as seen in Fig.1c). Interestingly, with small dataset sizes, the smallest transition rates tend to be underestimated by both methods, possibly because no (or relatively too few) transitions between these states are observed.

Extended Data Fig. 2 Plot of rate matrix estimates from Fig. 1e.

We see that the entries of the LG rate matrix⁵ and re-estimates with CherryML and EM are quite similar. The only noticeable differences correspond to four of the rarest (and thus harder to estimate) rates, which are between C and E, and C and K (in both directions). It is possible that CherryML is underestimating these rates, but since there is no ground truth and the model may be misspecified (as these are real data estimates), we cannot say with certainty; bootstrap estimates of variance could partially answer this question. For reference, the dataset size in terms of number of families, sequences, sites and residues is: 3,412, ≈ 50,000, ≈ 600,000 and ≈ 6.5M respectively. In principle, this is roughly comparable in size to 256 families in the (well-specified) simulations from Fig. 1b,c, where both EM and CherryML are accurate even for small rates, as seen in Extended Data Fig. 1. However, these direct comparisons of dataset size might overestimate the information content of real datasets. Indeed, it is possible that the effective amount of information for these small rates is more comparable to 16 families in the simulations from Fig. 1b,c, where CherryML produces more underestimates than does EM.

Extended Data Fig. 3 CherryML matches EM accuracy on diverse datasets.

On diverse datasets from the QMaker paper⁹, CherryML matches the accuracy of the EM method. The end-to-end runtime of each approach (including tree estimation) is shown. The runtime of the CherryML optimizer was in all cases negligible (less than 5 minutes), therefore end-to-end runtime was dominated by phylogeny reconstruction with FastTree, which took a few CPU hours depending on the dataset. In contrast, for the EM approach, the EM optimizer dominated runtime, leading to an overall slowdown of 5-20 fold in end-to-end runtime compared to the CherryML approach. Since tree estimation is embarrassingly parallel, end-to-end estimation with the CherryML method using 32 CPU cores takes only a few minutes on all of these datasets. The diversity of the datasets means that LG is no longer the best fit rate matrix compared to JTT and WAG. In fact, JTT is preferred in three of these datasets. This highlights the need to estimate new rate matrices for improved phylogenetic inference in specific applications⁹. Training dataset sizes are included for reference.

Extended Data Fig. 4 Plot of rate matrix estimates from Extended Data Fig 3.

Similarly to Extended Data Fig. 2, CherryML and EM agree on most rates, except for some of the smallest (harder to estimate) rates, where CherryML usually reports smaller rates. It is possible that these are underestimates from CherryML, for instance if the information content for these small rates is similar to 16 families in Extended Data Fig. 1, where CherryML produces more underestimates compared to EM.

Extended Data Fig. 5 Comparison of mutation rates.

(a) Our 400 × 400 coevolution model. (b) Independent-sites model.

Extended Data Fig. 6 Comparison of stationary distributions.

(a) Our 400 × 400 coevolution model. (b) Independent-sites model.

Supplementary information

Reporting summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Prillo, S., Deng, Y., Boyeau, P. et al. CherryML: scalable maximum likelihood estimation of phylogenetic models. Nat Methods 20, 1232–1236 (2023). https://doi.org/10.1038/s41592-023-01917-9

Download citation

Received: 21 September 2022
Accepted: 18 May 2023
Published: 29 June 2023
Issue Date: August 2023
DOI: https://doi.org/10.1038/s41592-023-01917-9