Abstract
Felsenstein’s bootstrap approach is widely used to assess confidence in species relationships inferred from multiple sequence alignments. It resamples sites randomly with replacement to build alignment replicates of the same size as the original alignment and infers a phylogeny from each replicate dataset. The proportion of phylogenies recovering the same grouping of species is its bootstrap confidence limit. However, standard bootstrap imposes a high computational burden in applications involving long sequence alignments. Here, we introduce the bag of little bootstraps approach to phylogenetics, bootstrapping only a few little samples, each containing a small subset of sites. We report that the median-bagging of bootstrap confidence limits from little samples produces confidence in inferred species relationships similar to standard bootstrap but in a fraction of the computational time and memory. Therefore, the little bootstraps approach can potentially enhance the rigor, efficiency and parallelization of big data phylogenomic analyses.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 /Â 30Â days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All simulated DNA sequence alignments containing 446 taxa were obtained from published research articles23,24. Ten empirical datasets from a variety of species have been analyzed. These DNA sequence alignments consisted of sequences from Eutherian mammals14, butterflies7, plants (A6 and B10), insects (A11, B12 and C5), spiders (A9 and B8) and birds13. All empirical and simulated datasets analyzed in this paper are available in an online repository28. Source data are provided with this paper.
Code availability
R codes are available from https://github.com/ssharma2712/Little-Bootstraps. A capsule containing source codes and datasets for our analyses is available on the CodeOcean service29. Users can replicate the little bootstraps sampling and bagging steps in this capsule.
References
Felsenstein, J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791 (1985).
Kumar, S. & Filipski, A. Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res. 17, 127–135 (2007).
Kumar, S., Filipski, A. J., Battistuzzi, F. U., Kosakovsky Pond, S. L. & Tamura, K. Statistics and truth in phylogenomics. Mol. Biol. Evol. 29, 457–472 (2012).
Kapli, P., Yang, Z. & Telford, M. J. Phylogenetic tree building in the genomic age. Nat. Rev. Genet. 21, 428–444 (2020).
Johnson, K. P. et al. Phylogenomics and the evolution of hemipteroid insects. Proc. Natl Acad. Sci. USA 115, 12775–12780 (2018).
Ran, J. H., Shen, T. T., Wu, H., Gong, X. & Wang, X. Q. Phylogeny and evolutionary history of Pinaceae updated by transcriptomic analysis. Mol. Phylogenet. Evol. 129, 106–116 (2018).
Allio, R. et al. Whole genome shotgun phylogenomics resolves the pattern and timing of swallowtail butterfly evolution. Syst. Biol. 69, 38–60 (2020).
Hedin, M., Derkarabetian, S., Alfaro, A., RamĂrez, M. J. & Bond, J. E. Phylogenomic analysis and revised classification of atypoid mygalomorph spiders (Araneae, Mygalomorphae), with notes on arachnid ultraconserved element loci. PeerJ 7, e6864 (2019).
Kuntner, M. et al. Golden orbweavers ignore biological rules: phylogenomic and comparative analyses unravel a complex evolution of sexual size dimorphism. Syst. Biol. 68, 555–572 (2019).
Pessoa-Filho, M., Martins, A. M. & Ferreira, M. E. Molecular dating of phylogenetic divergence between Urochloa species based on complete chloroplast genomes. BMC Genomics 18, 516 (2017).
Peters, R. S. et al. Evolutionary history of the Hymenoptera. Curr. Biol. 27, 1013–1018 (2017).
Peters, R. S. et al. Transcriptome sequence-based phylogeny of chalcidoid wasps (Hymenoptera: Chalcidoidea) reveals a history of rapid radiations, convergence and evolutionary success. Mol. Phylogenet. Evol. 120, 286–296 (2018).
Yonezawa, T. et al. Phylogenomics and morphology of extinct paleognaths reveal the origin and evolution of the ratites. Curr. Biol. 27, 68–77 (2017).
Song, S., Liu, L., Edwards, S. V. & Wu, S. Resolving conflict in Eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc. Natl Acad. Sci. USA 109, 14942–14947 (2012).
Stamatakis, A., Hoover, P. & Rougemont, J. A rapid bootstrap algorithm for the RAxML web servers. Syst. Biol. 57, 758–771 (2008).
Minh, B. Q., Nguyen, M. A. T. & Von Haeseler, A. Ultrafast approximation for phylogenetic bootstrap. Mol. Biol. Evol. 30, 1188–1195 (2013).
Kleiner, A., Talwalkar, A., Sarkar, P. & Jordan, M. I. A scalable bootstrap for massive data. J. R. Stat. Soc. B Stat. Methodol. 76, 795–816 (2014).
Seo, T.-K. Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Mol. Biol. Evol. 25, 960–971 (2008).
Pattengale, N. D., Alipour, M., Bininda-Emonds, O. R. P., Moret, B. M. E. & Stamatakis, A. How many bootstrap replicates are necessary? J. Comput. Biol. 17, 337–354 (2010).
Leys, C., Ley, C., Klein, O., Bernard, P. & Licata, L. Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 49, 764–766 (2013).
Nguyen, L. T., Schmidt, H. A., Von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
Lemoine, F. et al. Renewing Felsenstein’s phylogenetic bootstrap in the era of big data. Nature 556, 452–456 (2018).
Rosenberg, M. S. & Kumar, S. Heterogeneity of nucleotide frequencies among evolutionary lineages and phylogenetic inference. Mol. Biol. Evol. 20, 610–621 (2003).
Tamura, K. et al. Estimating divergence times in large molecular phylogenies. Proc. Natl Acad. Sci. USA 109, 19333–19338 (2012).
R Core Team. R: a language and environment for statistical computing (R Foundation for Statistical Computing, 2020).
Pagès, H., Aboyoun, P., Gentleman, R. & DebRoy, S. Biostrings: efficient manipulation of biological strings. R Package Version 2.46.0 (Bioconductor, 2017); https://doi.org/10.18129/B9.bioc.Biostrings
Schliep, K. P. phangorn: phylogenetic analysis in R. Bioinformatics 27, 592–593 (2011).
Sharma, S. & Kumar, S. Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps. figshare https://doi.org/10.6084/m9.figshare.14130494
Sharma, S. & Kumar, S. Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps. CodeOcean https://doi.org/10.24433/CO.6432188.v1
Efron, B. Estimating the error rate of a prediction rule: improvement on cross-validation. J. Am. Stat. Assoc. 78, 316–331 (1983).
Acknowledgements
We thank S. Vahdatshoar and J. Davis for their help with computational analysis. We thank J. Craig, Q. Tao, M. Caraballo-Ortiz, A. Chroni, C. Palacios, S. L. K. Pond and S. Blair Hedges for providing critical comments on the manuscript. This research was supported by a grant from the US National Institutes of Health to S.K. (GM139540-01).
Author information
Authors and Affiliations
Contributions
S.K. initially conceived all the methods, designed many analyses, developed visualizations and wrote the manuscript. S.S. refined methods, designed and conducted analyses, refined visualizations and contributed to writing the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Computational Science thanks Alexandros Stamatakis and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 A comparison of the standard and little bootstrap approaches.
Steps of (a) the standard phylogeny bootstrap and (b) the little bootstraps (BS) approach. Shaded boxes represent sequence alignments, with width representing sequence length. In standard BS, L sites are randomly sampled with replacement from the original dataset containing L sites. In this resampling process, ~63.2% of the data points17,30 are expected to be represented in a bootstrap replicate dataset. Each replicate dataset is compressed into weighted resamples that contain only distinct site configurations and a vector of their counts (represented by stacks of dots). An ML tree is inferred from each replicate dataset, and the BCL for a species group is the proportion of times that appeared in bootstrap replicate phylogenies. In little BS, L sites are randomly sampled with replacement from the little dataset consisting of only l = Lg sites, which produces bootstrap replicate datasets. Because \(l \ll {{{\mathrm{L}}}}\), each site will be represented many times in the little bootstraps replicate datasets, which we refer to as upsampling that changes the frequency of unique site configurations. Stacks of dots are much higher for little BS due to upsampling than standard BS that involves only resampling. The number of distinct site configurations in the upsampled dataset is smaller than in the standard bootstrap replicate dataset because of \(l \ll {{{\mathrm{L}}}}\).
Extended Data Fig. 2 The number of sites used in little and standard bootstrap replicates.
The proportion of sites included in the little bootstrap replicates for little datasets with l = L0.7 (open circles) and standard bootstrap (closed circles). The choice of l = L0.7 offers increasingly greater computational savings for longer sequences because of a decreasing proportion of sites included in the little samples. For example, the standard bootstrap replicates always contain approximately 63%30 of the site configurations from the full datasets. But, the little dataset size is ~3.1% of the original alignment for L = 100,000 bases, but it decreases to ~1.6% when L increases 10-fold (1,000,000 bases).
Extended Data Fig. 3 Patterns of unique site configurations per sequence and little sample size.
The relationship of the number of unique site configurations per sequence (C/S, log-transformed) and little sample size selected (power factor, g) (R2 = 0.76).
Extended Data Fig. 4 Precision of little bootstrap confidence limits.
The relationship between little BS \(\widehat {BCL}\)s and their precision (standard errors) for the selected little BS parameters. The standard errors are inversely related to little bootstrap confidence limits (R2 = 0.59).
Source data
Source Data Fig. 1
Phylogenetic trees and analysis log files for Fig. 1.
Source Data Extended Data Fig. 2
Source codes (R-script) that produce source data for Extended Data Fig. 2.
Source Data Extended Data Fig. 3
Statistical source data for Extended Data Fig. 3.
Source Data Extended Data Fig. 4
Phylogenetic tree files for Extended Data Fig. 4.
Rights and permissions
About this article
Cite this article
Sharma, S., Kumar, S. Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps. Nat Comput Sci 1, 573–577 (2021). https://doi.org/10.1038/s43588-021-00129-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-021-00129-5
This article is cited by
-
Incongruence in the phylogenomics era
Nature Reviews Genetics (2023)