Abstract
Phylogenetic models of molecular evolution are central to numerous biological applications spanning diverse timescales, from hundreds of millions of years involving orthologous proteins to just tens of days relating to single cells within an organism. A fundamental problem in these applications is estimating model parameters, for which maximum likelihood estimation is typically employed. Unfortunately, maximum likelihood estimation is a computationally expensive task, in some cases prohibitively so. To address this challenge, we here introduce CherryML, a broadly applicable method that achieves several orders of magnitude speedup by using a quantized composite likelihood over cherries in the trees. The massive speedup offered by our method should enable researchers to consider more complex and biologically realistic models than previously possible. Here we demonstrate CherryML’s utility by applying it to estimate a general 400 × 400 rate matrix for residue–residue coevolution at contact sites in three-dimensional protein structures; we estimate that using current state-of-the-art methods such as the expectation-maximization algorithm for the same task would take >100,000 times longer.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The LG paper5 training and testing Pfam datasets consisting of 3,912 and 500 families, respectively, are available at http://www.atgc-montpellier.fr/models/index.php?model=lg. The Pfam dataset with structure data from Yang et al.19 consisting of 15,051 families is located at https://files.ipd.uw.edu/pub/trRosetta/training_set.tar.gz. The QMaker9 datasets are available at https://figshare.com/articles/dataset/QMaker-datasets_zip/9768101. Our simulated datasets used for Figs. 1b–d and 2a,b are available on Zenodo at https://zenodo.org/record/7830072#.ZDnPBuzMKTc. Instructions for how to reproduce all results in this paper using the above datasets can be found at: https://github.com/songlab-cal/CherryML.
Code availability
Code for reproducing all results in this paper, as well as code implementing the CherryML method for the LG model and for the coevolution model, is available on GitHub at the repository: https://github.com/songlab-cal/CherryML. The CherryML package allows seamless estimation of rate matrices from MSAs. An end-to-end demonstration on the plant dataset28 with train/test splits from the QMaker work9 is provided in the package’s README. A Code Ocean capsule of the CherryML package is provided at: https://codeocean.com/capsule/1152557/tree.
References
Dayhoff, M. O. & Schwartz, R. M. A model of evolutionary changes in protein. In Atlas of Protein Sequence and Structure, Ch. 22, 345–352 (National Biomedical Research Foundation, 1978).
Jones, D. T., Taylor, W. R. & Thornton, J. M. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8, 275–282 (1992).
Whelan, S. & Goldman, N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18, 691–699 (2001).
Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007).
Le, S. Q. & Gascuel, O. An improved general amino acid replacement matrix. Mol. Biol. Evol. 25, 1307–1320 (2008).
Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 (2010).
Bouckaert, R. et al. BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 15, e1006650 (2019).
Minh, B. Q. et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020).
Minh, B. Q., Dang, C. C., Vinh, L. S. & Lanfear, R. QMaker: fast and accurate method to estimate empirical models of protein evolution. Syst. Biol. 70, 1046–1060 (2021).
Yang, Z. Maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39, 306–314 (1994).
Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. Modelfinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).
Holmes, I. A model of indel evolution by finite-state, continuous-time machines. Genetics 216, 1187–1204 (2020).
Yeang, C.-H. & Haussler, D. Detecting coevolution in and among protein domains. PLOS Comput. Biol. 3, 1–13 (2007).
Felsenstein, J. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Syst. Biol. 22, 240–249 (1973).
Siepel, A. & Haussler, D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 21, 468–488 (2004).
Klosterman, P. S. et al. XRATE: a fast prototyping, training and annotation tool for phylo-grammars. BMC Bioinform. 7, 428 (2006).
Varin, C., Reid, N. & Firth, D. An overview of composite likelihood methods. Stat. Sin. 21, 5–42 (2011).
Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, 8026–8037 (NeurIPS, 2017).
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).
Price, M. N., Dehal, P. S. & Arkin, A. P. Fasttree 2: approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Franzosa, E. A. & Xia, Y. Structural determinants of protein evolution are context-sensitive at the residue level. Mol. Biol. Evol. 26, 2387–2395 (2009).
Echave, J., Spielman, S. J. & Wilke, C. O. Causes of evolutionary rate variation among protein sites. Nat. Rev. Genet. 17, 109–121 (2016).
Dang, C., Vinh, L., Gascuel, O., Hazes, B. & Le, Q. Fastmg: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets. BMC Bioinform. 15, 341 (2014).
Canh, N. D., Cao Dang, C., Vinh, L. S., Quang Minh, B. & Hoang, D. T. pQMaker: empirically estimating amino acid substitution models in a parallel environment. In 2020 12th International Conference on Knowledge and Systems Engineering (KSE), 324–329 (2020).
Jumper, J. M. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2015).
Bader, P., Blanes, S. & Casas, F. Computing the matrix exponential with an optimized taylor polynomial approximation. Mathematics 7, 1174 (2019).
Ran, J., Shen, T.-T., Wang, M.-M. & Wang, X.-Q. Phylogenomics resolves the deep phylogeny of seed plants and indicates partial convergent or homoplastic evolution between gnetales and angiosperms. Proc. R. Soc. B Biol. Sci. 285, 20181012 (2018).
Acknowledgements
S.P. acknowledges A. Umfurer for many helpful discussions on software design. We acknowledge the reviewers for their helpful feedback. We thank O. Gascuel for suggesting that we pair up more sequences beyond the original cherries in the trees, which empirically boosted the statistical efficiency of the method by around 10–30% (estimated). We also thank I. Holmes, J. Huelsenbeck, N. Thomas and W. DeWitt for helpful discussions. This research is supported in part by a National Institutes of Health grant R35-GM134922 (to Y.S.S.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
S.P. and Y.S.S. conceived and designed the study. S.P. developed, implemented and tested the method, with assistance from Y.D., P.B., X.L. and P.-Y.C. P.B. contributed to implementing optimization with PyTorch, while X.L. and P.-Y.C. contributed to parallelizing and optimizing the computation of count matrices and the simulations. Y.D. helped with testing the method. S.P. and Y.S.S. analyzed the coevolution model. S.P. and Y.S.S. wrote the manuscript and Y.D. made edits. Y.S.S. supervised the project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Olivier Gascuel, Alexandros Stamatakis and the other, anonymous, reviewer for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Plot of true versus estimated rate matrix entries for Fig. 1b,c.
For a select number of families (the multiples of 4), we plot the true versus estimated rate matrix entries. MRE stands for median relative error; ρ is Spearman’s rank correlation; r is Pearson correlation. For reference, we also indicate the total number of sequences, sites and residues in each dataset. As more data become available, estimation accuracy increases for both methods. Importantly, the loss of statistical efficiency of CherryML with respect to EM is relatively small (an estimated ≈ 50% as seen in Fig.1c). Interestingly, with small dataset sizes, the smallest transition rates tend to be underestimated by both methods, possibly because no (or relatively too few) transitions between these states are observed.
Extended Data Fig. 2 Plot of rate matrix estimates from Fig. 1e.
We see that the entries of the LG rate matrix5 and re-estimates with CherryML and EM are quite similar. The only noticeable differences correspond to four of the rarest (and thus harder to estimate) rates, which are between C and E, and C and K (in both directions). It is possible that CherryML is underestimating these rates, but since there is no ground truth and the model may be misspecified (as these are real data estimates), we cannot say with certainty; bootstrap estimates of variance could partially answer this question. For reference, the dataset size in terms of number of families, sequences, sites and residues is: 3,412, ≈ 50,000, ≈ 600,000 and ≈ 6.5M respectively. In principle, this is roughly comparable in size to 256 families in the (well-specified) simulations from Fig. 1b,c, where both EM and CherryML are accurate even for small rates, as seen in Extended Data Fig. 1. However, these direct comparisons of dataset size might overestimate the information content of real datasets. Indeed, it is possible that the effective amount of information for these small rates is more comparable to 16 families in the simulations from Fig. 1b,c, where CherryML produces more underestimates than does EM.
Extended Data Fig. 3 CherryML matches EM accuracy on diverse datasets.
On diverse datasets from the QMaker paper9, CherryML matches the accuracy of the EM method. The end-to-end runtime of each approach (including tree estimation) is shown. The runtime of the CherryML optimizer was in all cases negligible (less than 5 minutes), therefore end-to-end runtime was dominated by phylogeny reconstruction with FastTree, which took a few CPU hours depending on the dataset. In contrast, for the EM approach, the EM optimizer dominated runtime, leading to an overall slowdown of 5-20 fold in end-to-end runtime compared to the CherryML approach. Since tree estimation is embarrassingly parallel, end-to-end estimation with the CherryML method using 32 CPU cores takes only a few minutes on all of these datasets. The diversity of the datasets means that LG is no longer the best fit rate matrix compared to JTT and WAG. In fact, JTT is preferred in three of these datasets. This highlights the need to estimate new rate matrices for improved phylogenetic inference in specific applications9. Training dataset sizes are included for reference.
Extended Data Fig. 4 Plot of rate matrix estimates from Extended Data Fig 3.
Similarly to Extended Data Fig. 2, CherryML and EM agree on most rates, except for some of the smallest (harder to estimate) rates, where CherryML usually reports smaller rates. It is possible that these are underestimates from CherryML, for instance if the information content for these small rates is similar to 16 families in Extended Data Fig. 1, where CherryML produces more underestimates compared to EM.
Extended Data Fig. 5 Comparison of mutation rates.
(a) Our 400 × 400 coevolution model. (b) Independent-sites model.
Extended Data Fig. 6 Comparison of stationary distributions.
(a) Our 400 × 400 coevolution model. (b) Independent-sites model.
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Prillo, S., Deng, Y., Boyeau, P. et al. CherryML: scalable maximum likelihood estimation of phylogenetic models. Nat Methods 20, 1232–1236 (2023). https://doi.org/10.1038/s41592-023-01917-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-023-01917-9