An investigation of irreproducibility in maximum likelihood phylogenetic inference

Phylogenetic trees are essential for studying biology, but their reproducibility under identical parameter settings remains unexplored. Here, we find that 3515 (18.11%) IQ-TREE-inferred and 1813 (9.34%) RAxML-NG-inferred maximum likelihood (ML) gene trees are topologically irreproducible when executing two replicates (Run1 and Run2) for each of 19,414 gene alignments in 15 animal, plant, and fungal phylogenomic datasets. Notably, coalescent-based ASTRAL species phylogenies inferred from Run1 and Run2 sets of individual gene trees are topologically irreproducible for 9/15 phylogenomic datasets, whereas concatenation-based phylogenies inferred twice from the same supermatrix are reproducible. Our simulations further show that irreproducible phylogenies are more likely to be incorrect than reproducible phylogenies. These results suggest that a considerable fraction of single-gene ML trees may be irreproducible. Increasing reproducibility in ML inference will benefit from providing analyses’ log files, which contain typically reported parameters (e.g., program, substitution model, number of tree searches) but also typically unreported ones (e.g., random starting seed number, number of threads, processor type).

The results regarding species tree methods like ASTRAL are quite startling. It would be good to broaden the discussion around this point and to put the results of this paper in the context of the wider debate in the phylogenetic community around the relative merits of concatenation versus ILS aware summary methods. To add in discussing this point it would be good to confirm that all 15 concatenated alignments were reproducible?
Also related to this issue, do edges that differ in run 1 and run 2 trees ever have high bootstrap support? Obviously it may be too computationally intensive to assess this for every irreproducible gene, but it might be interesting to pick a random sample of the genes that produced different trees for run 1 and run 2 and check the distribution of bootstrap support values for the bipartitions that differed.
Is a broader "moral of the story" that using methods that are reliant on accurately inferred bifurcating genes trees is fundamentally flawed?
Minor queries and typographical issues Did you look for any differences in branch lengths between trees with the same topology? (panel b of fig 3 suggests that you did look and didn't find any differences). It is interesting that the irreproducibility issue never leads to finding different optima on a single tree.
If IQtree and Raxml both produce reproducible trees (i.e. their run 1 and run 2 are the same) do IQtree and Raxml always agree with each other?
If the processor and number of threads used were identical did this completely remove the issue of irreproducibility, or are their other factors at play as well? This is suggested but not explicitly stated. Line 108 what is normalised RF distance normalised by? Is a normalised RF of 0.58 saying that 58% of the splits differ? line 241-244 I don't understand the explanation given here "This is likely because different orders of addition of the per-site log likelihoods when using three or more threads in IQ-TREE could result in two different phylogenies with different log-likelihood values." addition is commutative so how can the result depend on the order? The submission addresses the issue of reproducibility of phylogenetic tree inference. The authors consider several large published eukaryotic datasets. On each set of related genes, they infer two trees using the same program and parameters and check to which extent the topology between them agree. They observe differences in a substantial number of cases, and conclude that phylogenetic inference suffers from widespread irreproducibility, and conclude that authors should disclose random seed number, number of treads, and processor type used.
The work addresses an important problem, and the results obtained are thought-provoking.
I have one main concern, described below in detail, which I think could be satisfactorily addressed in a revision of the manuscript with more discussion and a restatement of the main conclusion.

Main point
The issue of reproducibility is thus highly relevant. But what exactly should be reproducible? In this paper, the authors have taken the view that reproducibility means to obtain the same tree topology every time. But, as I just mentioned, trees are subject to inference uncertainty, which is typically gauged using branch support measures such as bootstrap support. If the difference between two topologies is in uncertain parts of the tree (i.e. within the confidence interval, so to speak), is this still a reproducibility issue?
Because of the vast number of possible tree topologies (for a set of just 20 taxa, there are 221,643,095,476,699,771,875 possible unrooted binary topologies) and the complexity of optimising over the often rugged likelihood "landscape" of the topological space, it can be very difficult to find the global optimum. Inferred trees should never be considered at face value, but rather interpreted in light of the inference uncertainty.
Thus, unless the purpose is to debug code, I see relatively limited value in enforcing the ability to reproduce specific rounding-off errors or optimisation paths, as long as these differences don't materially affect the output trees *after accounting for inference uncertainty*.
There is, however, a real cost in demanding such narrow-sense reproducibility. As the authors themselves observed, avoiding variation due to CPU-specific optimisations mean that programs will typically run more slowly. More importantly, it would impose strong constraints on computing on clusters, as this would require keeping track of the number of jobs, order of computations, and CPU types employed for each subtask.
So while I think that it is interesting to demonstrate that multiple runs of the same data, using the same parameters, can result in different point estimates of the topology (which the authors convincingly show), I feel the conclusion should not be that all authors now need to systematically report their choice of seed and CPU types, and deactivate multithreading, but rather that more work is needed to assess whether these differences are significant in light of the uncertainty, and whether they might affect conclusions drawn from the data.
It might be possible to start answering this question from the results at hand. For instance, looking at the supplementary materials, I could see that some of the differences in the ASTRAL trees are on nodes with high branch support, which I think is noteworthy and cause for concern. See also the discretionary point below.

Review by Barbara Holland
This paper is the first to my knowledge that looks at irreproducibility in phylogenetics that occurs even when the software, data and model applied are identical. It will be of keen interest to many users of phylogenetic methods. I expect that it will also prompt further software development by the research teams that support Raxml, IQTree and similar methods.
The key claim of the paper, that phylogenetic methods are frequently irreproducible when run under (almost) identical conditions has been convincingly demonstrated and clearly presented.

AUTHORS' RESPONSE:
We thank the reviewer for appreciating the importance of our work.
My main recommendation in a revision would be to somewhat broaden the scope of the discussion of the results. There are a couple of extra analyses that would be useful to do in order to support this, but I don't want to suggest anything that will takes months of computing time (the amount of computational effort that has gone into the paper is already considerable).
The results regarding species tree methods like ASTRAL are quite startling. It would be good to broaden the discussion around this point and to put the results of this paper in the context of the wider debate in the phylogenetic community around the relative merits of concatenation versus ILS aware summary methods. See e.g. David Bryant, Matthew W. Hahn. The Concatenation Question. Scornavacca, Celine; Delsuc, Frédéric; Galtier, Nicolas. Phylogenetics in the Genomic Era, No commercial publisher | Authors open access book, pp.3.4:1-3.4:23, 2020. ffhal-02535651 AUTHORS' RESPONSE: We thank the reviewer for the constructive suggestion. We added the concatenation-based maximum likelihood analyses for each dataset and included an additional section entitled "Irreproducibility of coalescent-and concatenation-based phylogenomic inference" that presents these results, cites some of the suggested studies, and discusses this important issue in the context of the debate between concatenation and coalescence.
As shown in the table below (Table 1 in the revised manuscript), we found that: a) 9 / 15 (60%) phylogenomic datasets produced topologically different coalescent-based ASTRAL species phylogenies when we inferred the Run1 and Run2 sets of individual gene trees with identical parameter settings in either IQ-TREE or RAxML-NG. When considering both topology and branch support values, we found that 11 / 15 (73.3%) phylogenomic datasets analyzed by IQ-TREE and 13 / 15 (87%) datasets analyzed by RAxML-NG yielded different coalescent-based ASTRAL species phylogenies. b) 15 / 15 phylogenomic datasets produced topologically identical concatenation-based ML species phylogenies when we inferred them twice independently from the same supermatrix using identical parameter settings in either IQ-TREE or RAxML-NG. All phylogenomic datasets analyzed by RAxML-NG yielded exactly the same concatenation-based ML species phylogeny, where the topology, branch length, and branch support are identical across Run1 and Run2. In contrast, only 4 / 15 (26%) phylogenomic datasets analyzed by IQ-TREE yielded exactly the same concatenation-based ML species phylogeny where the topology, branch length, and branch support are identical across Run1 and Run2.  Concatenation-based ML trees were inferred twice from the supermatrix using IQ-TREE and RAxML-NG with identical settings. a Percentage of bipartitions (or internal branches) that differ between two inferred species trees was quantified the normalized Robinson-Foulds tree distance. b Branch distance between two inferred species trees was computed by the branch score distance of Kuhner and Felsenstein with the R packages ape and phangorn. c The percentage of bipartitions (or internal branches) that received different bootstrap support values between two inferred species trees that were topologically identical (i.e., Topological difference is 0). When topologies, branch lengths, or internal branch support values between two inferred phylogenies are identical, their estimated values are zero and are shown in bold. "NA" denotes "not applicable" due to either lack of external branch lengths in coalescent-based ASTRAL tree, different topologies between two inferred species trees, or lack of standard bootstrap support values in the RAxML-NG analyses of the two largest datasets (Plant: Green plants and Fungi: Budding yeasts) due to expensive computation.
To add in discussing this point it would be good to confirm that all 15 concatenated alignments were reproducible?
AUTHORS' RESPONSE: Our study's focus was on reproducibility of the inferred phylogenies and not on the reproducibility of the sequence alignments, which were retrieved from previous studies. All analyses used identical multiple sequence alignments so we can confirm that the observed irreproducibility does not stem from variation in the multiple sequence alignments used as input. However, the reviewer is correct that reproducibility of multiple sequence alignments is an additional potential variable that needs to be considered to achieve reproducible phylogenies. For example, a recent letter to the editor in Nature Ecol. Evol.
(https://www.nature.com/articles/s41559-020-01296-w) stressed the importance of releasing uncurated sequence data. We have added a note of this issue in the last paragraph of the "Discussion" in the revised manuscript.
Also related to this issue, do edges that differ in run 1 and run 2 trees ever have high bootstrap support? Obviously it may be too computationally intensive to assess this for every irreproducible gene, but it might be interesting to pick a random sample of the genes that produced different trees for run 1 and run 2 and check the distribution of bootstrap support values for the bipartitions that differed.

AUTHORS' RESPONSE:
We thank the reviewer for the suggestion. To examine whether the bipartitions that differ between Run1 tree and Run2 tree were highly supported, we evaluated the support of each internal branch using 1,000 ultrafast bootstrap replicates (the option "-bb 1000") and 100 standard bootstrap replicates (the option "--bs-trees 100") for IQ-TREE analysis and RAxML-NG, respectively. Since running all 19,414 gene alignments from 15 phylogenomic datasets was computationally challenging, we sampled the first 100 genes from each dataset. For each gene alignment, we inferred gene trees twice with identical settings (see Supplementary Text). Figure 9 in the revised manuscript), we found that: a) The genes that yield topologically irreproducible phylogenies have lower bootstrap support values than genes that yield topologically reproducible phylogenies (panel a in figure below). b) By comparing the branch support of conflicting internal branches against those of congruent internal branches within irreproducible genes, we found that conflicting internal branches received lower bootstrap support values than congruent internal branches (panel b in figure below). c) Furthermore, examination the frequency of internal branches, which were strongly supported by rapid bootstrap value of ≥ 95 in IQ-TREE analysis and standard bootstrap value of ≥ 70 in RAxML-NG analysis, showed that the percentage of conflicting internal branches with strong bootstrap support is much lower than that of congruent internal branches (panel c in figure below).

As shown in the figure below (Supplementary
These new results are included in the second paragraph of the section entitled "Low phylogenetic informativeness, multithreading, and processor type contribute to irreproducibility" in the revised manuscript.
Is a broader "moral of the story" that using methods that are reliant on accurately inferred bifurcating genes trees is fundamentally flawed?
AUTHORS' RESPONSE: We thank the reviewer for raising this point. According to our results, we think that genes with low phylogenetic informativeness (e.g., low percentage of parsimony-informative sites in gene alignment or low average bootstrap support) are more likely to be irreproducible in their topology. Moreover, compared to reproducible bipartitions, irreproducible bipartitions tend to have lower bootstrap support. Using inferred bifurcating gene trees from Run1 and Run2 to infer coalescent-based ASTRAL species phylogenies led to irreproducibility in 9 / 15 datasets (Table 1).
To examine whether irreproducibility of coalescent-based ASTRAL species phylogenies could be reduced by removing poorly supported bipartitions from the Run1 and Run2 trees of single genes, we collapsed branches with low bootstrap support values (≤50%) for Run1 and Run2 gene trees, and then used these partially multifurcating genes trees to infer coalescent-based ASTRAL trees. As shown in the Table below (Table 2 in the revised manuscript), we found that the percentage of conflicting bipartitions across coalescent-based ASTRAL species phylogenies from the Run1 and Run2 sets of individual gene trees, ranges from 0.29% to 18.18%. Moreover, collapsing branches with low bootstrap support values eliminated the irreproducibility of coalescent-based ASTRAL species phylogenies for three and four phylogenomic datasets in IQ-TEE and in RAxML-NG, respectively. However, eight phylogenomic datasets in IQ-TREE and seven phylogenomic datasets in RAxML-NG still yielded topologically different ASTRAL species phylogenies.
These results suggest that the observed irreproducibility is sufficiently strong to alter the output trees even if we remove poorly supported branches from our gene trees. However, they also suggest that reducing the impact of lowly supported branches in genes trees can eliminate species tree irreproducibility in some datasets. We have included these new results in the second paragraph of the section "Irreproducibility of coalescent-and concatenation-based phylogenomic inference" in the revised manuscript.

Minor queries and typographical issues
Did you look for any differences in branch lengths between trees with the same topology? (panel b of fig 3 suggests that you did look and didn't find any differences). It is interesting that the irreproducibility issue never leads to finding different optima on a single tree.

) and identical branch lengths (KF = 0) (green) (see panel a in figure below). In addition, we found that differences in branch lengths between trees with the same topology were much smaller than those observed between trees that differed in their topologies (see panel b in figure below).
These new results are included in the second paragraph of the section "The ML phylogenies of a considerable number of genes in phylogenomic datasets are irreproducible" in the revised manuscript.
If IQtree and Raxml both produce reproducible trees (i.e. their run 1 and run 2 are the same) do IQtree and Raxml always agree with each other?

AUTHORS' RESPONSE: We thank the reviewer for the question. To examine whether gene trees that are topologically reproducible by IQ-TREE are topologically identical to gene trees that are topologically reproducible by RAxML-NG, we compared IQ-TREE-inferred gene trees with RAxML-NG-inferred gene trees. We found that only 3,940 / 19,414 (20.3%) gene alignments yielded topologically identical phylogenies in Run1 and Run2 of IQ-TREE and in Run1 and Run2 of RAxML-NG (see figure below).
These new results are included in the third paragraph of the section "The ML phylogenies of a considerable number of genes in phylogenomic datasets are irreproducible" in the revised manuscript.
If the processor and number of threads used were identical did this completely remove the issue of irreproducibility, or are their other factors at play as well? This is suggested but not explicitly stated.

AUTHORS' RESPONSE:
Other factors are at play. Phylogenetic studies often report alignment, program, substitution model, and number of tree searches, but seldom report starting seed number, number of threads, and processor type. To better understand how these three typically unreported parameters (starting seed number, number of threads, and processor type) affect the irreproducibility of phylogenetic trees, we used three large representative studies to explicitly illustrate how each unreported parameter influences the reproducibility of inference.
As shown in the figure below ( Figure 7 in the revised manuscript), we found that: a) If alignment, program, substitution model, and number of tree searches used to infer phylogenetic trees (which are typically reported) are unavailable, reproducibility is impossible.

b) As long as alignment, program, substitution model, number of tree searches (all typically reported), and starting seed number (typically unreported) are available, the percentage of topologically reproducible genes can increase up to around 34% and 87% in IQ-TREE and RAxML-NG analysis, respectively. c) If all five parameters in (b), the number of threads, and the processor type used are all available, then all gene trees inferred are topologically reproducible in RAxML-NG (irrespective of the number of threads) and in IQ-TREE (for up to two threads). However, when using 3 or more threads in IQ-TREE, ~50% of genes remain irreproducible.
We have revised the "Discussion" section, including Figure 7, to make this point clear. line 92 what does it mean for two trees to be significantly different? I.e. what is the null hypothesis that is rejected by the AU test?

AUTHORS' RESPONSE: We thank the reviewer for the question. We used the AU test to test whether two topologically different trees inferred by Run1 and Run2 can equally explain the gene alignment (D); the null hypothesis H0 is that they can, and the alternative hypothesis (H1) is that they cannot (i.e., the Run1 and Run2 topologies are significantly different). If the p-value P(D | H0) is smaller than 0.05, we reject H0 in favor of H1. We have revised this sentence to clarify this.
Line 108 what is normalised RF distance normalised by? Is a normalised RF of 0.58 saying that 58% of the splits differ? line 370 what is a kernel vector?

AUTHORS' RESPONSE:
We thank the reviewer for the question. We changed "kernel vector" to "kernel instruction".

Reviewer #2 (Remarks to the Author):
The submission addresses the issue of reproducibility of phylogenetic tree inference. The authors consider several large published eukaryotic datasets. On each set of related genes, they infer two trees using the same program and parameters and check to which extent the topology between them agree. They observe differences in a substantial number of cases, and conclude that phylogenetic inference suffers from widespread irreproducibility, and conclude that authors should disclose random seed number, number of treads, and processor type used.
The work addresses an important problem, and the results obtained are thought-provoking.
I have one main concern, described below in detail, which I think could be satisfactorily addressed in a revision of the manuscript with more discussion and a restatement of the main conclusion.
Finally, note that I have reviewed an earlier version of the manuscript. I thank the authors for addressing some of the minor points I raised in my previous report. The points included in this report are, as far as I can tell, yet to be addressed.

AUTHORS' RESPONSE:
We thank the reviewer for their comments on two different versions of this study, for appreciating the importance of our work, and for the constructive feedback. Phylogenetic methods are very broadly used, but the issue of interpreting trees, and in particular gauging the uncertainty around the inferred topology, is challenging even for experts (see e.g. intense debates around the placement of Ctenophora). Phylogenomic resources are complex pipelines, of which each step is afflicted by random and systematic errors.
The issue of reproducibility is thus highly relevant. But what exactly should be reproducible? In this paper, the authors have taken the view that reproducibility means to obtain the same tree topology every time. But, as I just mentioned, trees are subject to inference uncertainty, which is typically gauged using branch support measures such as bootstrap support. If the difference between two topologies is in uncertain parts of the tree (i.e. within the confidence interval, so to speak), is this still a reproducibility issue?
AUTHORS' RESPONSE: We fully agree with the reviewer's view that irreproducibility in phylogenetic inference can be examined not just for the tree's topology, but also other aspects, such as the tree's branch lengths, and the tree's branch support values (e.g., bootstrap support values).
Given this broader view of the reproducibility of phylogenetic inference, the reviewer asks whether the differences in topology are localized in parts of the tree that are weakly supported. To examine whether the bipartitions that differ between Run1 and Run2 trees were highly supported, we evaluated the reliability of each internal branch using 1,000 ultrafast bootstrap replicates (the option "-bb 1000") and 100 standard bootstrap replicates (the option "--bs-trees 100") for IQ-TREE and RAxML-NG, respectively. Since examining all 19,414 gene alignments from 15 phylogenomic datasets was computationally intractable, we sampled the first 100 genes from each dataset. For each gene alignment, we inferred gene trees twice with identical settings. Figure 9 in the revised manuscript), we found that: a) The genes that yield topologically irreproducible phylogenies have lower bootstrap support values than genes that yield topologically reproducible phylogenies (see panel a in figure below).

b) By comparing the branch support of conflicting internal branches against those of congruent internal branches within irreproducible genes (i.e., genes that yield two topologically different trees in Run1 and Run2), we found that conflicting internal branches received lower bootstrap support values than congruent internal branches (see panel b in figure below). c) Furthermore, we found that the percentage of conflicting internal branches with strong bootstrap support (rapid bootstrap value ≥ 95 in IQ-TREE analysis, bootstrap value ≥ 70 in RAxML-NG analysis), is much lower than that of congruent internal branches (see panel c in figure below).
These new results were included in the second paragraph of the section "Low phylogenetic informativeness, multithreading, and processor type contribute to irreproducibility" in the revised manuscript.
Given that these results suggest that reproducibility disproportionally affects parts of the topology that are weakly supported, the reviewer asks: "If the difference between two topologies is in uncertain parts of the tree (i.e. within the confidence interval, so to speak), is this still a reproducibility issue?" We believe these results do indeed raise a reproducibility issue for the following reason. If we use the Run1 and Run2 sets of individual gene trees (generated using identical parameter settings in either IQ-TREE or RAxML-NG), we find that 9 / 15 (60%) phylogenomic datasets produced topologically different coalescent-based ASTRAL species phylogenies (see new Table 1 in our revised manuscript). Notably, several of these datasets produced topologically different coalescent-based ASTRAL species phylogenies even if branches below 50% bootstrap support were collapsed in Run1 and Run2 gene trees (see new Table 2 in our revised manuscript). Thus, the observed irreproducibility is sufficiently strong to alter the results of state-of-the-art phylogenomic approaches, even when sources of uncertainty are removed.
Because of the vast number of possible tree topologies (for a set of just 20 taxa, there are 221,643,095,476,699,771,875 possible unrooted binary topologies) and the complexity of optimising over the often rugged likelihood "landscape" of the topological space, it can be very difficult to find the global optimum. Inferred trees should never be considered at face value, but rather interpreted in light of the inference uncertainty.

AUTHORS' RESPONSE:
We agree that there is uncertainty in inference that stems from the fact that we are performing heuristic searches. However, we think that our results on irreproducibility reveal uncertainty in inference that is well over and above that stemming from the fact that we are conducting heuristic searches. Thus, unless the purpose is to debug code, I see relatively limited value in enforcing the ability to reproduce specific rounding-off errors or optimisation paths, as long as these differences don't materially affect the output trees *after accounting for inference uncertainty*.

AUTHORS' RESPONSE:
We thank the reviewer for the comment. To investigate the effect of irreproducibility of single gene trees on output trees, we used the Run1 and Run2 sets of individual gene trees (generated using identical parameter settings in either IQ-TREE or RAxML-NG) to infer coalescent-based ASTRAL species phylogenies. We find that 11 / 15 (73%) phylogenomic datasets produced topologically different coalescent-based ASTRAL species phylogenies. Thus, the observed irreproducibility is sufficiently strong to alter the output trees (see Table 1 in revised manuscript).
We next collapsed branches with low bootstrap support values (≤50%) for Run1 and Run2 gene trees, and then used these partially multifurcating genes trees to infer coalescent-based ASTRAL trees. As shown in the Table below (Table 2 in the revised manuscript), we found that the percentage of conflicting bipartitions between coalescent-based ASTRAL species phylogenies from the Run1 and Run2 sets of individual gene trees, ranges from 0.29% to 18.18%, even though we accounted for the inference uncertainty thought collapsing branches with low bootstrap support values ( ≤50%). Although collapsing branches with low bootstrap support values eliminated the irreproducibility of coalescent-based ASTRAL trees for three and four phylogenomic datasets in IQ-TEE and in RAxML-NG, respectively, eight phylogenomic datasets in IQ-TEE and seven phylogenomic datasets in RAxML-NG still yielded topologically different ASTRAL species phylogenies. Thus, the observed irreproducibility is sufficiently strong to alter the output trees even if we remove poorly supported branches from our gene trees. Percentage of conflicting bipartitions between coalescent-based ASTRAL species phylogenies inferred using Run1 and Run2 gene trees. b Percentage of highly conflicting bipartitions (LPP ≥ 90%) between coalescent-based ASTRAL species phylogenies inferred using Run1 and Run2 gene trees.
We have included these new results in the second paragraph of the section "Irreproducibility of coalescent-and concatenation-based phylogenomic inference" in the revised manuscript.
There is, however, a real cost in demanding such narrow-sense reproducibility. As the authors themselves observed, avoiding variation due to CPU-specific optimisations mean that programs will typically run more slowly. More importantly, it would impose strong constraints on computing on clusters, as this would require keeping track of the number of jobs, order of computations, and CPU types employed for each subtask.

AUTHORS' RESPONSE:
We agree with the reviewer that the cost is substantial. Given that phylogenomic analyses are computationally expensive (e.g., analysis of each of the 15 phylogenomic datasets took an average of ~3,400 CPU hours), it would be counterproductive to conduct phylogenomic inference on specific nodes / processors. Instead, we believe that authors should provide the log file of each analysis, which contains the values of key parameters (e.g., type of processor, number of threads, and random seed) for reproducibility so that they can evaluate any differences in the results of analyses of the same dataset. This is straightforward to do and requires relatively low effort and cost. Of course, whether our suggestion is adopted by individual researchers and the phylogenetics community if for them to decide. However, we believe the levels of irreproducibility observed in our analyses justify raised this suggestion in the last sentence in the abstract and in the last paragraph of the "Discussion".
So while I think that it is interesting to demonstrate that multiple runs of the same data, using the same parameters, can result in different point estimates of the topology (which the authors convincingly show), I feel the conclusion should not be that all authors now need to systematically report their choice of seed and CPU types, and deactivate multithreading, but rather that more work is needed to assess whether these differences are significant in light of the uncertainty, and whether they might affect conclusions drawn from the data.

AUTHORS' RESPONSE:
We thank the reviewer for the suggestion. We have already described in our previous responses that irreproducibility can yield different output trees in phylogenomic analyses. To further examine whether the observed differences are significant, we used the AU test to see whether two topologically different trees inferred two replicates (Run1 and Run2) can equally explain (the null hypothesis H0) the gene alignment (D). If the p-value P(D | H0) is smaller than equal to 0.05, we reject the null hypothesis H0, suggesting the Run1 and Run2 gene trees are significantly different. Figure 4 in the revised manuscript), we found that the Run1 and Run2 topologies for 302 / 3,515 (8.59%) irreproducible single-gene ML phylogenies generated by IQ-TREE and 457 / 1,813 (25.2%) irreproducible phylogenies generated by RAxML-NG were significantly different (the approximately unbiased (AU) test; P-value ≤ 0.05). We believe that these results, together with the results of the new Table 1 (that shows that this irreproducibility can affect coalescent-based species inference), suggest that they can affect the conclusions drawn from the data. Clearly more work is necessary (and we have now stated this in the penultimate paragraph of the "Discussion" section, but we believe that our recommendation to provide the log file of each analysis is both reasonable (given our results) and involves low effort and cost.