Felsenstein’s application of the bootstrap method to evolutionary trees is one of the most cited scientific papers of all time. The bootstrap method, which is based on resampling and replications, is used extensively to assess the robustness of phylogenetic inferences. However, increasing numbers of sequences are now available for a wide variety of species, and phylogenies based on hundreds or thousands of taxa are becoming routine. With phylogenies of this size Felsenstein’s bootstrap tends to yield very low supports, especially on deep branches. Here we propose a new version of the phylogenetic bootstrap in which the presence of inferred branches in replications is measured using a gradual ‘transfer’ distance rather than the binary presence or absence index used in Felsenstein’s original version. The resulting supports are higher and do not induce falsely supported branches. The application of our method to large mammal, HIV and simulated datasets reveals their phylogenetic signals, whereas Felsenstein’s bootstrap fails to do so.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank F. Delsuc, S. Holmes, L. Chindelevitch and E. Susko for help and suggestions. This work was supported by the EU-H2020 Virogenesis project (grant number 634650, to E.W., T.D.O. and O.G.), by the INCEPTION project (PIA/ANR-16-CONV-0005, to F.L., D.C., M.D.F. and O.G.), by the Institut Français de Bioinformatique (IFB - ANR-11-INBS-0013, to D.C.), by the Flagship grant from the South African Medical Research Council (MRC-RFA-UFSP-01-2013/UKZN HIVEPI to E.W., T.D.O. and J.-B.D.E.) and by the H3ABioNet project (NIH grant number U41HG006941 to J.-B.D.E. and U24HG006941 to E.W. and T.D.O.).Reviewer information
Nature thanks E. Susko and the other anonymous reviewer(s) for their contribution to the peer review of this work.
Extended data figures and tables
a–d, For each number of taxa (16, 128, 1024 and 8,191 in a, b, c and d, respectively) and random tree model, we compare the transfer index average over 100 runs with the upper-bound p − 1 (top graphs in each panel). We also compare the average transfer bootstrap support (TBE) to 0, and provide the maximum value observed among 100 runs (dashed lines), thus approximating the 1% quantile of the distribution (bottom graphs). In these experiments, the number of random ‘bootstrap’ trees is equal to 1,000. With l ≥ 1,024 (c), the average transfer index with random trees is very close in relative value to the upper-bound p − 1 and the approximation is already satisfying with l = 128 (b). Furthermore, the results are nearly the same for the four random tree models, suggesting that the asymptotic behaviour holds in a number of settings. As expected, the approximation of the transfer index over random bootstrap trees by p − 1 is better with small values of p. These results explain why moderate TBE supports—for example, 70% as used in this article—are sufficient to reject poor branches, as a TBE branch support of 70% cannot be observed by chance, even with a small number of taxa (for example, 16, as in a). Source data
FBP and TBE supports are compared with respect to branch depth, quartet conflicts with the NCBI taxonomy and tree size (see main text and legends of Figs. 1, 2 for explanations). Three support cut-offs are used to select the branches: 50%, 70% and 90% (for example, 28 branches among the 1,446 in total have TBE ≥ 90% and 11 have FBP ≥ 90%). The FastTree topology is poor, with 38% of quartets contradicted by the NCBI taxonomy, and 404 of the 1,441 branches with contradictions above 20%. Despite this difficulty, FBP and TBE perform well: they give supports larger than 70% to a very low number of moderately ((5,20]%) and highly (> 20%) conflictual branches. FBP supports very few deep branches, whereas TBE supports a larger number of branches and is especially useful with large trees. Comparing the three cut-offs, we see that with a 50% cut-off the selected branches are still weakly contradicted, especially with FBP; as expected, with TBE the fraction of contradicted branches (> 5%) is a bit higher but still low (∼7%). With a cut-off of 90% very few branches are selected (∼2% with TBE), thus justifying the use of the 70% threshold for TBE—as is standard with FBP. Source data
Extended Data Fig. 3 Comparison of FBP and TBE using the mammal dataset and the phylogeny inferred by RAxML with rapid bootstrap.
FBP and TBE supports are compared with respect to branch depth, quartet conflicts with the NCBI taxonomy and tree size (see main text and legends of Figs. 1, 2 for explanations). Three support cut-offs are used to select the branches: 50%, 70% and 90% (for example, 41 branches among the 1,446 in total have TBE ≥ 90% and 19 have FBP ≥ 90%). The RAxML topology is closer to the NCBI taxonomy than is the FastTree topology (27% versus 38% of contradicted quartets, and 353 versus 404 branches with contradiction > 20%, respectively). However, the RAxML topology is still relatively poor, as expected in this type of phylogenetic study based on a unique marker (Fig. 4 and main text). Despite this difficulty, FBP and TBE perform well as they give supports larger than 70% to a very low number of moderately ((5,20]%) and highly (> 20%) conflictual branches. The supports obtained with RAxML are higher than those obtained with FastTree (47 versus 29 branches with FBP > 70% for RAxML and FastTree, respectively; 158 versus 108 branches with TBE > 70% for RAxML and FastTree, respectively). Part of the explanation could be that the RAxML tree is more accurate than that of FastTree, and is thus better supported. Another factor is that the rapid bootstrap tends to be more supportive than the standard procedure, as shown in previous publications16. Indeed, the rapid bootstrap uses already inferred trees to initiate tree searching, and therefore tends to produce less diverse bootstrap trees than the standard, slower procedure, which restarts tree searching from the very beginning for each replicate. Despite these differences between FastTree and RAxML with rapid bootstrap, similar conclusions are drawn when comparing FBP and TBE: FBP supports very few deep branches, whereas TBE supports a larger number of them; TBE is especially useful with large trees; and both methods support a very low number of contradicted branches. Comparing the support cut-offs, 70% again appears as a good compromise for both FBP and TBE. Source data
FBP and TBE supports are compared with respect to branch depth, and tree size (see main text and legends of Figs. 1, 2 for explanations). Three support cut-offs are used to select the branches: 50%, 70% and 90% (for example, 1,624 branches among the 9,144 in total have TBE > 70% and 1,031 have FBP > 70%). Results are for the most part similar to those observed with the mammal dataset. We see a major effect of depth on FBP supports: with the full dataset, less than 1% of the deep (p > 16) branches have FBP support larger than 70%, whereas this percentage is higher than 20% with TBE. The effect of tree size is less pronounced. The fraction of supported branches decreases when the tree size increases from 35 to 571 taxa, but is analogous between 571 and 9,147 taxa. Furthermore, the gap between FBP and TBE remains similar, probably owing to the very large number of cherries and small clades, for which TBE and FBP are nearly equivalent. Regarding the support cut-off, 70% again appears as a good compromise for TBE, though there is no way to evaluate the fraction of supported branches that is actually erroneous. The interpretability of TBE will be a major asset for choosing the support level depending on the phylogenetic question being addressed. Here, as recombinant sequences are inevitable, lower supports than with mammals are likely to be acceptable. Source data
Extended Data Fig. 5 Subtype deep branching and comparison of FBP and TBE using medium-sized HIV datasets.
As the taxa were randomly drawn from the full dataset, the supports and findings show some fluctuations. a, b, Trees obtained with two of the medium-sized datasets; branches with FBP > 70%: yellow dots; branches with TBE > 70%: blue dots; subtype clades: red stars, filled if support > 70% (see Methods and Fig. 1 legend for further details). c, Deep branching of the subtypes19 and supports obtained on the full dataset (see also Fig. 1). Rare subtypes (H, J and K) are absent in the medium-sized datasets, and the subtype clades are almost perfectly recovered (only one incorrect taxon in A clade for both trees). FBP supports are higher when using medium-sized datasets than when using the full dataset (for example, 58% and 99% for subtype B, versus 3% in Fig. 1). However, some subtype clades (for example, D) have moderate FBP support, though the clade matches the subtype perfectly. When using TBE, all subtype supports are higher than 95%. The deep branching is the same for all full and medium-sized datasets, and is identical to that found in a previous study19, but is not supported by FBP, whereas TBE is larger than 70% for every branch (or path in Fig. 1). Again, the Indian and East African sub-epidemics of subtype C are supported by TBE, but not by FBP.
We see a clear difference between the distributions of the instability score for the recombinant and non-recombinant sequences, which means that this score can be used to detect or confirm the recombinant status of sequences (box quantiles: 25%, 50% and 75%). See main text for details. Source data
Noisy data include rogue taxa and homoplasy and non-noisy data do not (see Methods for details). The graphs display the distribution of branches with FBP or TBE support > 70%. Supports are compared regarding branch depth, tree size and quartet conflicts with the model tree used for simulations (see main text and legends of Figs. 1, 2 for explanations). Results are fully congruent with those obtained with real datasets. TBE supports more deep branches than FBP, especially with noisy data. The effect of tree size is also more visible with noisy MSA, and the number of supported branches with moderate ((5,20]%) and high (> 20%) conflict levels is very low, for both FBP and TBE. Source data
Extended Data Fig. 8 Comparison of FBP and TBE at different support cut-offs using simulated, noisy data.
Comparison of FBP and TBE with respect to branch depth, quartet conflicts and tree size, at different support cut-offs (see main text and legends of Figs. 1, 2 for explanations). A cut-off of 50% seems to be acceptable, as neither FBP nor TBE support highly contradicted branches. However, this could be due to the low level of contradiction compared to real datasets (85 branches with contradiction > 20%, versus about 400 in the mammal dataset in Extended Data Figs. 2, 3). Source data
Extended Data Fig. 9 Distribution of the instability score in rogue taxa using simulated, noisy data.
TBE again appears to be useful for detecting and confirming rogue taxa (box quantiles: 25%, 50% and 75%). See main text for details. Source data
The bootstrap theory1,2 indicates that with large samples the supports estimated using bootstrap replicates should be close to supports obtained with datasets of the same size drawn from the same distribution as the original sample. We used simulated data to check that this property holds with protein MSAs of 1,449 taxa and about 500 sites (see main text for details). a, b, Comparison of the two supports (a, FBP; b, TBE) for all branches in the tree inferred by RAxML from the original MSA. We observe a clear correlation, which is higher for TBE (ρ = 0.85) than for FBP (ρ = 0.75) using Pearson’s linear correlation coefficient, but identical (0.83) using Spearman’s rank coefficient, which is better suited to the discontinuous nature of FBP. These results appear to contradict previous conclusions7 that the bootstrap is a highly imprecise measure of repeatability. However, this previous work measured the probability of inferring the correct tree (not the supports of inferred branches, as consistent in the bootstrap context) and its main result was based on 50 sites, which is probably too low for the bootstrap theory to apply. The bootstrap also relies on the plug-in principle2,3,6,9, which states that the distribution of the distance between the true tree and the inferred tree can be well-approximated by the distribution of the distance between the inferred and bootstrap trees. c, The accuracy of TBE in predicting the topological distance between b and the true tree as measured using the normalized transfer index, for every branch b inferred by RAxML from the original MSA. Again, we observe a clear correlation (ρ = 0.74, Spearman’s rank coefficient = 0.70). We performed the same experiment with FBP, seeking to predict the presence or absence (1/0) of the inferred branch in the tree true; a lower but significant correlation was found (ρ = 0.59, Spearman’s rank coefficient = 0.54). d, Comparison using RAxML of the performance of simulation-based and bootstrap-based instability scores in detecting rogue taxa; both are nearly identical. TPR, true positive rate; FPR, false positive rate. e, Table summarizing the results described above and those of FastTree, which are nearly identical to those of RAxML, except regarding topological accuracy (%Correct, fraction of correct branches) for which RAxML is again more accurate than FastTree. Source data