Renewing Felsenstein’s phylogenetic bootstrap in the era of big data



Felsenstein’s application of the bootstrap method to evolutionary trees is one of the most cited scientific papers of all time. The bootstrap method, which is based on resampling and replications, is used extensively to assess the robustness of phylogenetic inferences. However, increasing numbers of sequences are now available for a wide variety of species, and phylogenies based on hundreds or thousands of taxa are becoming routine. With phylogenies of this size Felsenstein’s bootstrap tends to yield very low supports, especially on deep branches. Here we propose a new version of the phylogenetic bootstrap in which the presence of inferred branches in replications is measured using a gradual ‘transfer’ distance rather than the binary presence or absence index used in Felsenstein’s original version. The resulting supports are higher and do not induce falsely supported branches. The application of our method to large mammal, HIV and simulated datasets reveals their phylogenetic signals, whereas Felsenstein’s bootstrap fails to do so.

  • Subscribe to Nature for full access:



Additional access options:

Already a subscriber?  Log in  now or  Register  for online access.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    Efron, B. Bootstrap methods: another look at the jackknife. Ann. Stat. 7, 1–26 (1979).

  2. 2.

    Efron, B. & Tibshirani, R. J. An Introduction to the Bootstrap (Chapman & Hall, New York, 1993).

  3. 3.

    Felsenstein, J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791 (1985).

  4. 4.

    Van Noorden, R., Maher, B. & Nuzzo, R. The top 100 papers. Nature 514, 550–553 (2014).

  5. 5.

    Sanderson, M. J. Objections to bootstrapping phylogenies: a critique. Syst. Biol. 44, 299–320 (1995).

  6. 6.

    Holmes, S. Bootstrapping phylogenetic trees: theory and methods. Stat. Sci. 18, 241–255 (2003).

  7. 7.

    Hillis, D. M. & Bull, J. J. An empirical test of bootstrapping as a method for assessing confidence in phyogenetic analysis. Syst. Biol. 42, 182–192 (1993).

  8. 8.

    Felsenstein, J. & Kishino, H. Is there something wrong with the bootstrap on phylogenies? A reply to Hillis and Bull. Syst. Biol. 42, 193–200 (1993).

  9. 9.

    Efron, B., Halloran, E. & Holmes, S. Bootstrap confidence levels for phylogenetic trees. Proc. Natl Acad. Sci. USA 93, 7085–7090 (1996).

  10. 10.

    Susko, E. Bootstrap support is not first-order correct. Syst. Biol. 58, 211–223 (2009).

  11. 11.

    Zharkikh, A. & Li, W.-H. Estimation of confidence in phylogeny: the complete-and-partial bootstrap technique. Mol. Phylogenet. Evol. 4, 44–63 (1995).

  12. 12.

    Susko, E. First-order correct bootstrap support adjustments for splits that allow hypothesis testing when using maximum likelihood estimation. Mol. Biol. Evol. 27, 1621–1629 (2010).

  13. 13.

    Soltis, D. E. & Soltis, P. S. Applying the bootstrap in phylogeny reconstruction. Stat. Sci. 18, 256–267 (2003).

  14. 14.

    Huelsenbeck, J. & Rannala, B. Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. Syst. Biol. 53, 904–913 (2004).

  15. 15.

    Anisimova, M. & Gascuel, O. Approximate likelihood-ratio test for branches: a fast, accurate, and powerful alternative. Syst. Biol. 55, 539–552 (2006).

  16. 16.

    Anisimova, M., Gil, M., Dufayard, J. F., Dessimoz, C. & Gascuel, O. Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes. Syst. Biol. 60, 685–699 (2011).

  17. 17.

    Stamatakis, A., Hoover, P. & Rougemont, J. A rapid bootstrap algorithm for the RAxML Web servers. Syst. Biol. 57, 758–771 (2008).

  18. 18.

    Minh, B. Q., Nguyen, M. A. T. & von Haeseler, A. Ultrafast approximation for phylogenetic bootstrap. Mol. Biol. Evol. 30, 1188–1195 (2013).

  19. 19.

    Hemelaar, J. The origin and diversity of the HIV-1 pandemic. Trends Mol. Med. 18, 182–192 (2012).

  20. 20.

    Sanderson, M. J. & Shaffer, H. B. Troubleshooting molecular phylogenetic analyses. Annu. Rev. Ecol. Syst. 33, 49–72 (2002).

  21. 21.

    Wilkinson, M. Majority-rule reduced consensus trees and their use in bootstrapping. Mol. Biol. Evol. 13, 437–444 (1996).

  22. 22.

    Thorley, J. L. & Wilkinson, M. Testing the phylogenetic stability of early tetrapods. J. Theor. Biol. 200, 343–344 (1999).

  23. 23.

    Thomson, R. C. & Shaffer, H. B. Sparse supermatrices for phylogenetic inference: taxonomy, alignment, rogue taxa, and the phylogeny of living turtles. Syst. Biol. 59, 42–58 (2010).

  24. 24.

    Aberer, A. J., Krompass, D. & Stamatakis, A. Pruning rogue taxa improves phylogenetic accuracy: an efficient algorithm and webservice. Syst. Biol. 62, 162–166 (2013).

  25. 25.

    Sanderson, M. J. Confidence limits on phylogenies: the bootstrap revisited. Cladistics 5, 113–129 (1989).

  26. 26.

    Bréhélin, L., Gascuel, O. & Martin, O. Using repeated measurements to validate hierarchical gene clusters. Bioinformatics 24, 682–688 (2008).

  27. 27.

    Charon, I., Denoeud, L., Guénoche, A. & Hudry, O. Maximum transfer distance between partitions. J. Classif. 23, 103–121 (2006).

  28. 28.

    Day, W. H. E. The complexity of computing metric distances between partitions. Math. Soc. Sci. 1, 269–287 (1981).

  29. 29.

    Lin, Y., Rajan, V. & Moret, B. M. E. A metric for phylogenetic trees based on matching. IEEE/ACM Trans. Comput. Biol. Bioinform. 9, 1014–1022 (2012).

  30. 30.

    Künsch, H. R. The jackknife and the bootstrap for general stationary observations. Ann. Stat. 17, 1217–1241 (1989).

  31. 31.

    Billera, L. J., Holmes, S. P. & Vogtmann, K. Geometry of the space of phylogenetic trees. Adv. Appl. Math. 27, 733–767 (2001).

  32. 32.

    Kumar, S., Filipski, A. J., Battistuzzi, F. U., Kosakovsky Pond, S. L. & Tamura, K. Statistics and truth in phylogenomics. Mol. Biol. Evol. 29, 457–472 (2012).

  33. 33.

    Truszkowski, J. & Goldman, N. Maximum likelihood phylogenetic inference is consistent on multiple sequence alignments, with or without gaps. Syst. Biol. 65, 328–333 (2016).

  34. 34.

    Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).

  35. 35.

    Grenfell, B. T. et al. Unifying the epidemiological and evolutionary dynamics of pathogens. Science 303, 327–332 (2004).

  36. 36.

    Schultz, A.-K. et al. jpHMM: improving the reliability of recombination prediction in HIV-1. Nucleic Acids Res. 37, W647–W651 (2009).

  37. 37.

    Robinson, D. F. & Foulds, L. R. Comparison of phylogenetic trees. Math. Biosci. 53, 131–147 (1981).

  38. 38.

    Semple, C. & Steel, M. A. Phylogenetics (Oxford Univ. Press, Oxford, 2003).

  39. 39.

    Lefort, V., Longueville, J. E. & Gascuel, O. SMS: smart model selection in PhyML. Mol. Biol. Evol. 34, 2422–2424 (2017).

  40. 40.

    Letunic, I. & Bork, P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 44, W242–W245 (2016).

  41. 41.

    Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).

  42. 42.

    Sand, A. et al. tqDist: a library for computing the quartet and triplet distances between binary or general trees. Bioinformatics 30, 2079–2080 (2014).

  43. 43.

    Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).

  44. 44.

    Delatorre, E. O. & Bello, G. Phylodynamics of HIV-1 subtype C epidemic in east Africa. PLoS ONE 7, e41904 (2012).

  45. 45.

    Soares, M. A. et al. A specific subtype C of human immunodeficiency virus type 1 circulates in Brazil. AIDS 17, 11–21 (2003).

  46. 46.

    Siddappa, N. B. et al. Identification of subtype C human immunodeficiency virus type 1 by subtype-specific PCR and its use in the characterization of viruses circulating in the southern parts of India. J. Clin. Microbiol. 42, 2742–2751 (2004).

  47. 47.

    Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 (2010).

  48. 48.

    Fletcher, W. & Yang, Z. INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26, 1879–1888 (2009).

Download references


We thank F. Delsuc, S. Holmes, L. Chindelevitch and E. Susko for help and suggestions. This work was supported by the EU-H2020 Virogenesis project (grant number 634650, to E.W., T.D.O. and O.G.), by the INCEPTION project (PIA/ANR-16-CONV-0005, to F.L., D.C., M.D.F. and O.G.), by the Institut Français de Bioinformatique (IFB - ANR-11-INBS-0013, to D.C.), by the Flagship grant from the South African Medical Research Council (MRC-RFA-UFSP-01-2013/UKZN HIVEPI to E.W., T.D.O. and J.-B.D.E.) and by the H3ABioNet project (NIH grant number U41HG006941 to J.-B.D.E. and U24HG006941 to E.W. and T.D.O.).

Reviewer information

Nature thanks E. Susko and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author information


  1. Unité Bioinformatique Evolutive, C3BI USR 3756, Institut Pasteur & CNRS, Paris, France

    • F. Lemoine
    • , D. Correia
    • , M. Dávila Felipe
    •  & O. Gascuel
  2. Hub Bioinformatique et Biostatistique, C3BI USR 3756, Institut Pasteur & CNRS, Paris, France

    • F. Lemoine
  3. Department of Computer Science, University of the Western Cape, Cape Town, South Africa

    • J.-B. Domelevo Entfellner
  4. South African MRC Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Cape Town, South Africa

    • J.-B. Domelevo Entfellner
  5. KwaZulu-Natal Research Innovation and Sequencing Platform (KRISP), School of Laboratory Medicine and Medical Sciences, College of Health Sciences, University of KwaZulu-Natal, Durban, South Africa

    • E. Wilkinson
    •  & T. De Oliveira
  6. Centre for the AIDS Programme of Research in South Africa (CAPRISA), University of KwaZulu-Natal, Durban, South Africa

    • T. De Oliveira
  7. Méthodes et Algorithmes pour la Bioinformatique, LIRMM UMR 5506, Université de Montpellier & CNRS, Montpellier, France

    • O. Gascuel


  1. Search for F. Lemoine in:

  2. Search for J.-B. Domelevo Entfellner in:

  3. Search for E. Wilkinson in:

  4. Search for D. Correia in:

  5. Search for M. Dávila Felipe in:

  6. Search for T. De Oliveira in:

  7. Search for O. Gascuel in:


O.G. designed the research; F.L., J.-B.D.E., M.D.F. and O.G. performed the research; F.L. and J.-B.D.E. implemented the algorithms; F.L. and D.C. realized the website and GitHub repositories; F.L. performed the analyses and graphics, with the help of E.W. and T.D.O. for HIV; O.G. wrote the paper with the help of all co-authors.

Competing interests

The authors declare no competing interests.

Corresponding author

Correspondence to O. Gascuel.

Extended data figures and tables

  1. Extended Data Fig. 1 Transfer index expectation and TBE support with random trees.

    ad, For each number of taxa (16, 128, 1024 and 8,191 in a, b, c and d, respectively) and random tree model, we compare the transfer index average over 100 runs with the upper-bound p − 1 (top graphs in each panel). We also compare the average transfer bootstrap support (TBE) to 0, and provide the maximum value observed among 100 runs (dashed lines), thus approximating the 1% quantile of the distribution (bottom graphs). In these experiments, the number of random ‘bootstrap’ trees is equal to 1,000. With l ≥ 1,024 (c), the average transfer index with random trees is very close in relative value to the upper-bound p − 1 and the approximation is already satisfying with l = 128 (b). Furthermore, the results are nearly the same for the four random tree models, suggesting that the asymptotic behaviour holds in a number of settings. As expected, the approximation of the transfer index over random bootstrap trees by p − 1 is better with small values of p. These results explain why moderate TBE supports—for example, 70% as used in this article—are sufficient to reject poor branches, as a TBE branch support of 70% cannot be observed by chance, even with a small number of taxa (for example, 16, as in a). Source data

  2. Extended Data Fig. 2 Comparison of FBP and TBE using the mammal dataset and FastTree phylogeny.

    FBP and TBE supports are compared with respect to branch depth, quartet conflicts with the NCBI taxonomy and tree size (see main text and legends of Figs. 1, 2 for explanations). Three support cut-offs are used to select the branches: 50%, 70% and 90% (for example, 28 branches among the 1,446 in total have TBE ≥ 90% and 11 have FBP ≥ 90%). The FastTree topology is poor, with 38% of quartets contradicted by the NCBI taxonomy, and 404 of the 1,441 branches with contradictions above 20%. Despite this difficulty, FBP and TBE perform well: they give supports larger than 70% to a very low number of moderately ((5,20]%) and highly (> 20%) conflictual branches. FBP supports very few deep branches, whereas TBE supports a larger number of branches and is especially useful with large trees. Comparing the three cut-offs, we see that with a 50% cut-off the selected branches are still weakly contradicted, especially with FBP; as expected, with TBE the fraction of contradicted branches (> 5%) is a bit higher but still low (7%). With a cut-off of 90% very few branches are selected (2% with TBE), thus justifying the use of the 70% threshold for TBE—as is standard with FBP. Source data

  3. Extended Data Fig. 3 Comparison of FBP and TBE using the mammal dataset and the phylogeny inferred by RAxML with rapid bootstrap.

    FBP and TBE supports are compared with respect to branch depth, quartet conflicts with the NCBI taxonomy and tree size (see main text and legends of Figs. 1, 2 for explanations). Three support cut-offs are used to select the branches: 50%, 70% and 90% (for example, 41 branches among the 1,446 in total have TBE ≥ 90% and 19 have FBP ≥ 90%). The RAxML topology is closer to the NCBI taxonomy than is the FastTree topology (27% versus 38% of contradicted quartets, and 353 versus 404 branches with contradiction > 20%, respectively). However, the RAxML topology is still relatively poor, as expected in this type of phylogenetic study based on a unique marker (Fig. 4 and main text). Despite this difficulty, FBP and TBE perform well as they give supports larger than 70% to a very low number of moderately ((5,20]%) and highly (> 20%) conflictual branches. The supports obtained with RAxML are higher than those obtained with FastTree (47 versus 29 branches with FBP > 70% for RAxML and FastTree, respectively; 158 versus 108 branches with TBE > 70% for RAxML and FastTree, respectively). Part of the explanation could be that the RAxML tree is more accurate than that of FastTree, and is thus better supported. Another factor is that the rapid bootstrap tends to be more supportive than the standard procedure, as shown in previous publications16. Indeed, the rapid bootstrap uses already inferred trees to initiate tree searching, and therefore tends to produce less diverse bootstrap trees than the standard, slower procedure, which restarts tree searching from the very beginning for each replicate. Despite these differences between FastTree and RAxML with rapid bootstrap, similar conclusions are drawn when comparing FBP and TBE: FBP supports very few deep branches, whereas TBE supports a larger number of them; TBE is especially useful with large trees; and both methods support a very low number of contradicted branches. Comparing the support cut-offs, 70% again appears as a good compromise for both FBP and TBE. Source data

  4. Extended Data Fig. 4 Comparison of FBP and TBE using the HIV dataset and FastTree phylogeny.

    FBP and TBE supports are compared with respect to branch depth, and tree size (see main text and legends of Figs. 1, 2 for explanations). Three support cut-offs are used to select the branches: 50%, 70% and 90% (for example, 1,624 branches among the 9,144 in total have TBE > 70% and 1,031 have FBP > 70%). Results are for the most part similar to those observed with the mammal dataset. We see a major effect of depth on FBP supports: with the full dataset, less than 1% of the deep (p > 16) branches have FBP support larger than 70%, whereas this percentage is higher than 20% with TBE. The effect of tree size is less pronounced. The fraction of supported branches decreases when the tree size increases from 35 to 571 taxa, but is analogous between 571 and 9,147 taxa. Furthermore, the gap between FBP and TBE remains similar, probably owing to the very large number of cherries and small clades, for which TBE and FBP are nearly equivalent. Regarding the support cut-off, 70% again appears as a good compromise for TBE, though there is no way to evaluate the fraction of supported branches that is actually erroneous. The interpretability of TBE will be a major asset for choosing the support level depending on the phylogenetic question being addressed. Here, as recombinant sequences are inevitable, lower supports than with mammals are likely to be acceptable. Source data

  5. Extended Data Fig. 5 Subtype deep branching and comparison of FBP and TBE using medium-sized HIV datasets.

    As the taxa were randomly drawn from the full dataset, the supports and findings show some fluctuations. a, b, Trees obtained with two of the medium-sized datasets; branches with FBP > 70%: yellow dots; branches with TBE > 70%: blue dots; subtype clades: red stars, filled if support > 70% (see Methods and Fig. 1 legend for further details). c, Deep branching of the subtypes19 and supports obtained on the full dataset (see also Fig. 1). Rare subtypes (H, J and K) are absent in the medium-sized datasets, and the subtype clades are almost perfectly recovered (only one incorrect taxon in A clade for both trees). FBP supports are higher when using medium-sized datasets than when using the full dataset (for example, 58% and 99% for subtype B, versus 3% in Fig. 1). However, some subtype clades (for example, D) have moderate FBP support, though the clade matches the subtype perfectly. When using TBE, all subtype supports are higher than 95%. The deep branching is the same for all full and medium-sized datasets, and is identical to that found in a previous study19, but is not supported by FBP, whereas TBE is larger than 70% for every branch (or path in Fig. 1). Again, the Indian and East African sub-epidemics of subtype C are supported by TBE, but not by FBP.

  6. Extended Data Fig. 6 Distribution of the instability score in HIV recombinants.

    We see a clear difference between the distributions of the instability score for the recombinant and non-recombinant sequences, which means that this score can be used to detect or confirm the recombinant status of sequences (box quantiles: 25%, 50% and 75%). See main text for details. Source data

  7. Extended Data Fig. 7 Comparison of FBP and TBE using non-noisy and noisy simulated data.

    Noisy data include rogue taxa and homoplasy and non-noisy data do not (see Methods for details). The graphs display the distribution of branches with FBP or TBE support > 70%. Supports are compared regarding branch depth, tree size and quartet conflicts with the model tree used for simulations (see main text and legends of Figs. 1, 2 for explanations). Results are fully congruent with those obtained with real datasets. TBE supports more deep branches than FBP, especially with noisy data. The effect of tree size is also more visible with noisy MSA, and the number of supported branches with moderate ((5,20]%) and high (> 20%) conflict levels is very low, for both FBP and TBE. Source data

  8. Extended Data Fig. 8 Comparison of FBP and TBE at different support cut-offs using simulated, noisy data.

    Comparison of FBP and TBE with respect to branch depth, quartet conflicts and tree size, at different support cut-offs (see main text and legends of Figs. 1, 2 for explanations). A cut-off of 50% seems to be acceptable, as neither FBP nor TBE support highly contradicted branches. However, this could be due to the low level of contradiction compared to real datasets (85 branches with contradiction > 20%, versus about 400 in the mammal dataset in Extended Data Figs. 2, 3). Source data

  9. Extended Data Fig. 9 Distribution of the instability score in rogue taxa using simulated, noisy data.

    TBE again appears to be useful for detecting and confirming rogue taxa (box quantiles: 25%, 50% and 75%). See main text for details. Source data

  10. Extended Data Fig. 10 Repeatability and accuracy of FBP and TBE using simulated data.

    The bootstrap theory1,2 indicates that with large samples the supports estimated using bootstrap replicates should be close to supports obtained with datasets of the same size drawn from the same distribution as the original sample. We used simulated data to check that this property holds with protein MSAs of 1,449 taxa and about 500 sites (see main text for details). a, b, Comparison of the two supports (a, FBP; b, TBE) for all branches in the tree inferred by RAxML from the original MSA. We observe a clear correlation, which is higher for TBE (ρ = 0.85) than for FBP (ρ = 0.75) using Pearson’s linear correlation coefficient, but identical (0.83) using Spearman’s rank coefficient, which is better suited to the discontinuous nature of FBP. These results appear to contradict previous conclusions7 that the bootstrap is a highly imprecise measure of repeatability. However, this previous work measured the probability of inferring the correct tree (not the supports of inferred branches, as consistent in the bootstrap context) and its main result was based on 50 sites, which is probably too low for the bootstrap theory to apply. The bootstrap also relies on the plug-in principle2,3,6,9, which states that the distribution of the distance between the true tree and the inferred tree can be well-approximated by the distribution of the distance between the inferred and bootstrap trees. c, The accuracy of TBE in predicting the topological distance between b and the true tree as measured using the normalized transfer index, for every branch b inferred by RAxML from the original MSA. Again, we observe a clear correlation (ρ = 0.74, Spearman’s rank coefficient = 0.70). We performed the same experiment with FBP, seeking to predict the presence or absence (1/0) of the inferred branch in the tree true; a lower but significant correlation was found (ρ = 0.59, Spearman’s rank coefficient = 0.54). d, Comparison using RAxML of the performance of simulation-based and bootstrap-based instability scores in detecting rogue taxa; both are nearly identical. TPR, true positive rate; FPR, false positive rate. e, Table summarizing the results described above and those of FastTree, which are nearly identical to those of RAxML, except regarding topological accuracy (%Correct, fraction of correct branches) for which RAxML is again more accurate than FastTree. Source data

Supplementary information

  1. Supplementary Data

    This folder contains all multiple alignments, trees and (Nextflow) workflows

  2. Reporting Summary

Source data


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.