Much progress has been achieved in disentangling evolutionary relationships among species in the tree of life, but some taxonomic groups remain difficult to resolve despite increasing availability of genome-scale data sets. Here we present a practical approach to studying ancient divergences in the face of high levels of conflict, based on explicit gene genealogy interrogation (GGI). We show its efficacy in resolving the controversial relationships within the largest freshwater fish radiation (Otophysi) based on newly generated DNA sequences for 1,051 loci from 225 species. Initial results using a suite of standard methodologies revealed conflicting phylogenetic signal, which supports ten alternative evolutionary histories among early otophysan lineages. By contrast, GGI revealed that the vast majority of gene genealogies supports a single tree topology grounded on morphology that was not obtained by previous molecular studies. We also reanalysed published data sets for exemplary groups with recalcitrant resolution to assess the power of this approach. GGI supports the notion that ctenophores are the earliest-branching animal lineage, and adds insight into relationships within clades of yeasts, birds and mammals. GGI opens up a promising avenue to account for incompatible signals in large data sets and to discern between estimation error and actual biological conflict explaining gene tree discordance.
The advent of genomic approaches is delivering unprecedented amounts of sequence data from non-model organisms, sparking enthusiasm and heightening expectations about the resolution of ancient divergences in the tree of life
. Substantial controversy persists, however, concerning the best way to analyse genome-wide data sets, especially for taxonomic groups shown to be recalcitrant to phylogenetic resolution
In the past decade, the field of molecular phylogenetics has shifted from concatenation methods to employing an increasingly diverse collection of multi-species coalescent approaches to account for ILS
. It is theoretically sound to use methods that model coalescent variance, particularly those that integrate over gene tree uncertainty in a Bayesian framework
. Yet, full parametric co-estimation of gene trees and species trees is not currently scalable to large, genome-wide data sets, which are instead analysed by reconciling a collection of pre-estimated individual gene trees under the coalescent. A major assumption of these ‘summary’
coalescent methods is that individual gene trees accurately depict the genealogical history of fragments of the genome that independently segregate (coalescent genes, or c-genes). To meet this theoretical challenge with empirical data sets, practitioners of phylogenetics have been trapped between two undesirable extremes. On one end, the analysis of short, recombination-free genes (consisting of a few hundred sites) are error-prone due to limited signal-to-noise content
Here, we present a phylogenomic approach that efficiently extracts the genealogical signal from short c-genes by reducing the complexity of tree space on the basis of topological constraints. This method is similar to others that place priors on gene tree topologies
, but is unique in that priors are set to test specific hypotheses directly. We show how this procedure resolves longstanding controversies using newly generated data for otophysan fishes and published data sets for other exemplary groups (metazoans, neoavian birds, eutherian mammals and yeasts). Otophysan fishes constitute the dominant group in freshwater habitats around the world, having experienced one of the nine major radiations among jawed vertebrates
. The clade, comparable in diversity to birds, consists of more than 10,000 species arrayed into 77 families, 7 suborders, and 4 orders (Cypriniformes, Characiformes, Siluriformes and Gymnotiformes). Otophysans include the well-studied model species (zebrafish, Danio rerio), carps, minnows, characins (for example, tetras and piranhas), knifefishes (such as the electric eel) and catfishes. For the past three decades, the most widely accepted hypothesis of relationships among otophysan orders has been based on an exemplary morphological analysis (hereafter referred to as H
. Molecular studies, in contrast, have produced conflicting phylogenetic results that differ from the null morphological tree
To address this challenging phylogenetic question, we collected genome-wide sequence data from 1,051 exons using target capture and Illumina sequencing for 225 species representing all major otophysan lineages (Supplementary Table 2). Exons targeted for this study were chosen from genome comparisons to select single-copy short sequences (with an average length of 200 bp), while avoiding long stretches of DNA to minimize recombination. Analyses of complete data sets, smaller subsets and individual gene fragments using a range of standard approaches designed to minimize conditions that may lead to systematic error failed to provide compelling support for a single phylogenetic hypothesis, suggesting that choice of method (concatenation or species trees), data subset (for example strong signal, conserved genes, and so on) or data type (DNA or protein sequences) strongly influences the outcome (Fig. 1). In this case, far from settling the dispute, best practice methodologies aimed at minimizing systematic error in phylogenomics seemed to exacerbate it — neither concatenation nor species tree methods, nor DNA-based or protein-based analyses, converge on a single topology. To gain additional insight, we developed an analytical approach based on topology tests that gauges the strength of phylogenetic signal contained in each gene alignment in favour of alternative hypotheses. By constraining gene-tree space to a small number of relevant options (15 in this case; Fig.1), this approach overcomes gene tree estimation error to reveal overwhelming evidence favouring H 0. To further assess the utility and performance of this approach, we re-examined published data sets for other groups with controversial phylogenetic relationships.
Genealogical signal of exon markers at different scales
Before inspecting incongruence among concatenation and species tree methods in regard to the central hypothesis being investigated (the interrelationships of otophysan lineages), we assessed the collective performance of exon markers in multi-locus analyses and the extent of estimation error for individual gene trees by: (i) evaluating support for uncontroversial groups (otophysan orders, suborders and families) that are independent of the central hypothesis (Supplementary Fig. 1); (ii) comparing tree space dispersion plots using multidimensional scaling (MDS) based on unweighted Robinson–Foulds distances 32 (Supplementary Fig. 2A); and (iii) estimating average support values across all clades in the corresponding trees (Supplementary Fig. 2B). The first test is a proxy for phylogenetic accuracy (the probability of resolving undisputed groups), whereas the latter two measure phylogenetic precision (the deviation of estimates in tree space and robustness of inferences).
Individual gene trees were estimated using standard partitioned maximum likelihood (ML) and Bayesian methods, whereas multi-locus analyses explored a large number of alternative approaches either involving concatenation or species tree methods, applied to multiple data sets (complete data or subsets filtered by properties) and data types (DNA and protein sequences; Supplementary Table 2) to account for potential systematic error due to base compositional biases 33,34 . For multi-locus methods, resolution of expected taxonomic groups of otophysans is almost unanimously obtained with high confidence (Supplementary Fig. 1). The resulting multi-locus trees are well supported (with an average support of 79.1%; Supplementary Fig. 2B) and appear tightly clustered in tree space (Supplementary Fig. 2A), suggesting high phylogenetic precision. These results indicate that, collectively, our exon markers contain strong phylogenetic signal at different evolutionary scales, and seem resilient to specific assumptions underlying each method.
By contrast, individual gene trees perform poorly both in terms of accuracy and precision, almost always failing to resolve undisputed groups (Supplementary Fig. 1), displaying topological distances in tree space that are orders of magnitude greater than those of multi-locus phylogenies (Supplementary Fig. S2A), and resulting in poorly supported clades (with an average support of 24.8%; Supplementary Fig. 2B). This result is not unexpected given that the average length of exons in our data set is 200 bp or 67 amino acids. Although short c-genes have the benefit of minimizing the risk of recombination, these results indicate that gene tree error is extensive.
Incongruence between concatenation and species trees
Despite the ability of multi-locus methods to resolve undisputed clades, the branching order of major otophysan lineages receives equivocal support (Fig. 1). We designed and implemented 45 different analyses of multi-locus data based on commonly applied concatenation and coalescent methods (Supplementary Table 2), implementing several criteria to minimize systematic error, and obtained support for 10 out of 15 possible topologies (Fig. 1). The distribution of results is decidedly uneven, with most concatenation methods supporting topology H a01 and most species tree methods supporting topology H a02. Variants of both approaches also support other topologies, and H 0 ranks second or third in frequency (seven analyses support both H 0 and H a02). No individual gene tree resolves any of these alternatives, confirming a high degree of estimation error based on single loci. These results suggest that in-depth exploration of phylogenomic data sets using alternative methods reflecting widely accepted best-practice criteria will reveal high levels of incongruence that is not easily integrated with current methodology to unambiguously support a single phylogeny 6,24 . Even more worrisome is the observation that conflicting topologies often receive strong bootstrap support, especially those resulting from concatenation analyses (Figs 1, f2, 3). We suggest that averaged support values from trees inferred from alternative analyses and data subsets (Supplementary Fig. 3) may provide a more realistic way to reflect nodal support and confidence in phylogenomics, while also accounting for incongruence inherent to data set type or method.
A method to overcome gene tree error
Instead of using error-prone gene trees as input for coalescent analyses, we devised ‘gene genealogy interrogation’ (GGI), an approach based on topology tests to identify the genealogical history, among a set of predefined alternatives, that each gene supports with highest probability. To establish the ranking of alternative trees and their probabilities, GGI implements constrained ML searches to optimize site likelihood scores for each gene alignment under each hypothesis. The method is based on the approximately unbiased (AU) topology test 35 , which uses multi-scale bootstrapping techniques and can be applied to simultaneous comparisons of multiple trees. GGI is designed to address one phylogenetic problem at a time by defining a set of alternative hypotheses. If gene tree error is suspected to be a major source of conflict in other parts of the tree, then new GGI tests must be conducted.
We applied GGI to test the central hypothesis of otophysan relationships and examined all possible unrooted topologies for five lineages (Fig. 1), conducting a total of 31,530 constrained ML searches (15 topologies for each of 1,051 gene trees, based on protein or DNA sequences). Here, each alternative hypothesis is defined by a different set of phylogenetic ‘backbone’ relationships between major lineages. In each optimization, we constrained each of the five major subclades to be monophyletic (see below and the Supplementary Information), but we imposed no other constraint with regard to relationships within each subclade, nor with respect to branch lengths nor model parameters. More than twice as many topology tests found that hypothesis H 0 was supported with the highest probability, for both DNA (495 loci) and protein (314 loci) data sets, compared to the second-best hypothesis (H a10) with 174 and 146 tests in favour, respectively. This difference increased to 5-fold (325 versus 69 for DNA and 197 versus 39 for protein) for tests results where the best hypothesis (H 0) is significantly better (P < 0.05) than the second ranked hypothesis (H a10). All alternative topologies received negligible support (Fig. 2a,b and Supplementary Table 3). Interestingly, both DNA- (Fig. 2a) and protein-based (Fig. 2b) GGI analyses produced similar results, suggesting that non-stationarity at the DNA level is not a significant systematic bias compromising the topology tests.
We acknowledge that while monophyly of the five major otophysan groups (subclades) is supported with high confidence by multiple lines of evidence at the species-tree level (morphology, mitochondrial DNA, multi-locus nuclear DNA and genomics
ILS is not the main problem
Coalescent theory predicts that phylogenetic histories of lineages evolving under a combination of short internal branches and large effective population sizes are prone to high incidence of ILS 8 . It has been demonstrated that for five or more lineages such conditions can generate gene trees with topologies that differ from the underlying species phylogeny with highest probability 36,37 . When the evolutionary history of a clade falls within this so-called anomaly zone 8 , simply adopting the most frequent gene tree as a surrogate for the species phylogeny (the democratic vote procedure) is positively misleading.
To account for this possibility, as the genuine backbone tree of inter-ordinal relationships must be 1 of 15 possibilities (enumeration of all possibilities for an unrooted tree of five taxa), we used the GGI trees selected by the topology tests (the preferred constrained gene trees optimized by ML) as input for summary coalescent analyses. For this test we employed both DNA- and protein-based trees in combination with two different species-tree methods. We also applied two alternative approaches for sampling GGI trees, one using all rank 1 trees (complete data with 1,051 genes) and another using only the set of rank 1 trees that are significantly better than the alternatives (P < 0.05; a subset of 397 DNA trees and 275 protein trees; Supplementary Table 3). Of the eight species-tree analyses conducted, all converged on the H 0 tree, with each backbone node receiving 100% bootstrap support. Finally, an adapted version of the GGI-based coalescent method that uses constrained topologies in combination with unconstrained gene trees also supports the H 0 tree (Supplementary Information).
Our results suggest that the evolutionary history of major otophysan lineages is not trapped in the anomaly zone. In fact, these analyses identify only a minor proportion of gene trees that are significantly discordant with the inferred species phylogeny (17.7–28.4%, most supporting H a10), suggesting that other sources of error rather than ILS are likely the main cause of incongruence. Gene tree estimation error may be biasing summary coalescent approaches, but the causes for discrepancy between coalescent and concatenation results are unclear. For two hypotheses (H 0 and H a03), some concatenation and species tree methods converge, but more often they seem to produce non-overlapping sets of results (Fig. 1). We were unable to isolate any single factor as the principal explanation for discordance in multi-locus analysis. Possibilities include the combination of slight model misspecifications interacting in analyses of large data sets and amplifying systematic biases, or processes such as horizontal gene transfer or duplication/extinction affecting some of the sampled genes 38 . What is perhaps most surprising is the observation that the most common topology from concatenation is incongruent with our GGI tree, even in the absence of evidence for substantial ILS. An investigation of factors that could account for this pattern would be a fruitful subject of future theoretical and analytical studies. In summary, the coalescent analyses using GGI trees resolve with high confidence the branching order of major otophysan groups (Supplementary Fig. 3), a result that is fully congruent with the morphological hypothesis (H 0) 27 , thereby reconciling a long history of molecular and morphological conflict.
Addressing other recalcitrant clades with GGI
To test the generality of the GGI approach, we conducted additional tests using published phylogenomic data sets for distantly related groups with controversial resolution in the tree of life (Supplementary Table 4). We chose four emblematic phylogenetic questions that have recently received substantial attention: (i) the position of sponges and ctenophores (comb jellies) at the base of the animal (metazoan) tree
Two contrasting patterns emerge from these tests (Fig. 2c–f and Supplementary Table 3). First, as in the case of otophysans, the metazoan data set provides strong differential support in favour of a single topology. While the traditional view has been that sponges are the first branching lineage in the animal tree, most recent phylogenomic studies support the so-called Ctenophora-sister hypothesis that places comb jellies as the sister group to all other animals (refs.
For metazoans, yeast and mammals, we test hypotheses involving only four lineages, implying that only three possible topologies need to be considered. Because rooted three-taxon (or unrooted four-taxon) species trees are free from anomalies under the coalescent 8,36,37 , the most frequent gene tree topology in these cases may be interpreted as the species phylogeny (assuming subclade monophyly in individual gene trees is undisrupted by deep coalescences; see Supplementary Information). A topology supporting the clade Saccharomyces castellii + Saccharomyces sensu stricto is more frequently favoured (426 genes) than the two other alternatives (332 and 312 respectively) based on the yeast data set (Fig. 2f). This result is consistent with the gene tree frequencies originally reported 4 . For the mammalian data set (Fig. 2e), the GGI results prefer one of two competing hypotheses, albeit by a small difference: 155 genes place the tree shrew (Scandentia) as sister to primates, as claimed by the original study, whereas 165 genes place it as sister to a clade including Rodentia plus Lagomorpha, in agreement with another reanalysis of this data set 3,20 . The placement of the mousebird among major neoavian lineages is a six-taxon problem that entails tests for 105 possible topologies (an analysis beyond the scope of this study). Our preliminary GGI analyses did, however, provide a test among eight competing hypotheses 43,44 , favouring with statistical significance the position of the mousebird as sister to other Afroaves 44 . For this case, high levels of gene tree discordance have been attributed to pervasive ILS during the early diversification of Neoaves 44 , requiring the set of 105 topology tests for GGI-based coalescent analyses.
Our GGI method provides a promising avenue to address difficult phylogenetic problems by accounting for gene tree estimation error through topology tests. The method interrogates individual gene partitions by constraining tree space to evaluate the relative support for specific hypotheses. This principle has been applied by other methods but without a priori references 25 , or using Bayesian applications that do not scale up to genome-level data sets 10,12 . Thus, GGI has the favourable property of avoiding potential pitfalls inherent to concatenation and many other species tree approaches.
For our otophysan data set, GGI resolves a longstanding question in fish systematics and provides unambiguous support for the null morphological tree
. This reconciliation has remained elusive in most previous molecular studies
Confidence in the selection of a preferred hypothesis provided by the AU test mitigates sampling error in tree estimation arising from limited signal in small gene partitions (that is, data sets composed of short gene fragments that are otherwise free of recombination), and avoids systematic biases with additive effects in large data sets. For cases where the main hypothesis can be defined in terms of an unrooted four-taxon statement (such as metazoans, mammals and yeasts), our GGI approach is expected to meet the statistical consistency of gene-tree ‘democratic vote’, even if severely affected by ILS. For problems involving five or more lineages (for example, otophysans and birds), we propose and apply a pipeline whereby we first estimate a set of plausible gene trees under our alternative hypotheses, rank them for each gene, and then use the highest ranked gene trees (under different criteria) as input for summary species-tree analysis (GGI-based species tree). For cases in which deep coalescences may result in the violation of the assumption of subclade monophyly imposed by the topological constraints (thereby making the assignment to specific n-taxon statements difficult), we apply a modified version of the GGI-based coalescent procedure that uses a mixture of constrained and unconstrained gene trees (Supplementary Information). Tree distributions obtained with GGI, combined with the coalescent analyses, may prove useful for a broad class of data sets as a practical option to resolve stubbornly ambiguous clades in the tree of life.
In conclusion, the effect of sampling error in gene tree estimation is often overlooked when implementing summary coalescent approaches to resolve ancient divergences and/or recalcitrant clades in the Tree of Life using genome-wide data 3,15,16 . Our study shows that gene genealogy interrogation is a useful tool to distinguish between estimation error and actual biological conflict in explaining gene tree discordance, ultimately improving phylogenetic reconstructions of complex events such as the early diversification of otophysan fishes. We acknowledge that correct interpretation of the signal of gene tree discordance requires holistic models accounting for all biological processes that affect phylogenetic reconstruction (such as ILS, paralogy and reticulation) 38,47 . Until such models become available and efficient enough to synthesize large numbers of gene trees, GGI is a promising way forward because it provides explicit tests for gene tree incongruence around hard-to-resolve nodes, increasing our ability to infer organismal phylogeny.
A flowchart of the experimental design and methodological approaches used is shown in Supplementary Fig. 4. Details of the pilot study are explained in the Supplementary Information. Databases are archived in Zenodo (http://dx.doi.org/10.5281/zenodo.51603). We first conducted a pilot experiment to sequence 3,957 orthologous exons using target enrichment (TE)
and Illumina (Supplementary Table 5). We selected exons by screening the zebrafish and medaka genomes for single-copy, slowly evolving genes
. Probes designed using zebrafish sequences were hybridized with the genomic DNA of 14 species encompassing the diversity of ray-finned fishes. We then chose a subset of single-copy exons exhibiting highest capture efficiency among otophysans, and designed a new probe set based on sequences from four otophysans and five outgroups obtained in our experiment. We used these markers to collect 1,051 protein-coding sequences for 225 species representing 53 (of 77) families (279,012 DNA or 92,901 protein sites). We estimated DNA- and protein-based gene trees using partitioned maximum likelihood (ML) and Bayesian approaches. To investigate incongruence and to identify the set of possible evolutionary histories of major otophysan lineages, we conducted a total of 45 different multi-locus analyses. These comprised concatenation (23 analyses) or coalescent-based methods (22 analyses); using either DNA (25 analyses) or protein (20 analyses) data sets; including complete data (13 analyses) or subsets of ~200 genes filtered following recommended criteria (32 analyses). Properties for subset selection include slowly evolving genes, strong phylogenetic signal, AT-richness, stationarity, and data completeness. Given their uncontroversial placement as first branching clade in Otophysi
Genomic data collection (Otophysi)
A total of 1,041 target loci were selected from the pilot study (Supplementary Methods), and 21 markers extensively used by previous molecular studies were added to the marker set 56,57 , including the mitochondrial COI gene for quality control (Supplementary Database 4). A new probe library was designed to capture the set of 1,041 slowly evolving exons based on sequences from nine species examined in the pilot study (Pellona, Chanos, Kneria, Tanichthys, Danio, Apteronotus, Brustarius, Astyanax and Oryzias). Probes for the remaining 21 markers were designed on the basis of sequences obtained from GenBank for 55 species representing major otophysan lineages. A total of 20,000 RNA baits (2× tiling) were synthesized by MYcroarray for the 1,062 marker set (Supplementary Database 5).
Tissue samples were collected from species that included 110 representative characiforms (12 from suborder Citharinoidei in 2 families, and 98 from suborder Characoidei in 21 families), 79 siluriforms (23 families) and 13 gymnotiforms (5 families). Because monophyly of Cypriniformes and its placement as the earliest branching otophysan lineage is uncontroversial, we only included 23 cypriniform species (4 families), and all were used as outgroups (Supplementary Table 1). In total, 10 samples yielded poor DNA quality; 6 others had to be excluded due to cross-contamination (detected by comparing COI sequences). The final taxonomic sampling consisted of 225 species representing 53 of the 77 valid families of Otophysi. Most samples sequenced include voucher specimens deposited in various museum collections (Supplementary Table 5).
Data collection and processing
For each sample, genomic DNA was extracted from fin or muscle tissue using a phenol-chloroform protocol in the Autogen platform. Library preparation, TE and Illumina sequencing (single-end) was outsourced to Rapid Genomics and followed the same protocols used for the pilot experiment. FASTQ files were trimmed using Geneious Pro v8.1 (http://www.geneious.com) with an error probability cutoff of 0.01. Contigs were assembled by mapping sequences against the zebrafish reference using the ‘medium sensitivity’ algorithm in Geneious with five iterations. The resulting contigs that assembled with <10× coverage and that were shorter than 75 bp were removed. Two loci with substantial amounts of missing data and nine loci producing more than one contig for at least one species after assembly also were excluded.
In summary, three consecutive steps were implemented to filter out putative cases of paralogy. (i) In silico screening of zebrafish and medaka genomes using reciprocal BLAST searches in EvolMarker 58 to select single-copy genes for the initial marker set (see pilot study in Supplementary Information). Although single-copy genes are defined on the basis of similarity thresholds, genes that share this property among distantly related genomes are probably orthologous. Gene duplications that take place in particular lineages may lead to the presence of in-paralogues that will not necessarily confound phylogenetic analysis of ancient divergences, but judicious exclusion of these may be warranted. (ii) Removal of 279 out of the initial 3957 loci used in the pilot study that produced two or more contig assemblies for at least one species. (iii) Removal of nine loci that resulted in two or more contigs for at least one species in the otophysan data set.
The final marker set consisted of 1,051 protein-coding genes (279,012 sites). The set of sequences obtained were aligned using MAFFT on a locus-by-locus basis. All alignments were visually inspected and edited to check for open reading frames. Seventy-one exon alignments had ambiguously aligned internal blocks that were removed to improve positional homology and to enable translation (Supplementary Databases 6 and 7). Alignments were translated to proteins using Translator X 59 . The final set of 1,051 exons were annotated using gene ontology (GO) in Blast2GO v3.2 (http://www.blast2go.org/), with a E-Value-Hit-Filter of 10−6, an annotation cut-off of 55, and a GO weight of 5 (Supplementary Database 8).
Phylogenetic analysis and alternative data matrices
Forty-five different multi-locus analyses were conducted. These comprised concatenation (23 analyses) or coalescent-based species tree methods (22 analyses) that used either DNA (25 analyses) or protein sequence data (20 analyses), for the complete data set (13 analyses) or of data subsets of ~200 genes (32 analyses; Supplementary Table 2). Subsets of markers were selected based on criteria recommended by previous studies
Gene trees with highest average bootstrap support (analyses 05, 17, 27 and 39). Following Salichos and Rokas 4 , this subset includes 200 loci that resulted in gene trees harbouring the highest average bootstrap support (BS) values across all internodes (estimated with RAxML). Average BS values were estimated with the phylogenetic package Ape 62 using R 63 . The average BS values were 65% and 35% for the DNA and protein subsets, respectively. See details under ‘Phylogenetic inference’ (below).
Gene tree congruence (analyses 06, 18, 28 and 40). To reduce gene tree estimation error, a subset of 210 gene trees with the lowest average pairwise Robinson–Foulds (RF) distance was selected following recommendations and using scripts provided by Simmons et al. 3 . Selected gene trees based on DNA and protein data sets had average RF distances of 0.70–0.81 and 0.96–0.90, respectively. Outlier gene trees were discarded by taking into account the number of shared terminals in pairwise comparisons.
Slowly evolving genes (analyses 07, 19, 29 and 41). The most conserved locus set was selected for phylogenetic analysis 2 . The 200 alignments with highest average identity (88–95% and 96–100% for DNA and protein alignments, respectively) were selected using Geneious.
Exons with longer sequences (analyses 08, 20, 29 and 42). This subset includes 205 locus alignments whose sequence length is greater than 350 nucleotides (60,225 sites) or 96 amino acids (28,907 sites). The underlying criterion is that longer exons harbour better signal-to-noise ratios that would minimize gene tree error.
Minimizing missing data (analyses 09, 21, 31 and 43). All single-locus alignments that had at least 200 species (out of 225) were included to minimize empty cells per taxon in the corresponding gene matrices, thus reducing the proportion of missing data 60 . A total of 231 loci were selected for both DNA (68,682 sites) and protein (22,441 sites) sequence sets.
Genes shared with other studies (analyses 11, 12, 22, 23, 32, 33, 44 and 45). Two subsets were assembled following recent studies that used exon-based phylogenomics in fishes and applied different criteria for marker selection (Li et al. 49 ; Inoue et al. 61 ). A total of 243 loci (60,147 sites) in common with Li et al. 49 and 175 loci (44,559 sites) in common with Inoue et al. 61 were selected.
Genes with minimal base compositional bias (analyses 12 and 34). This criterion seeks to minimize potentially misleading effects of base composition heterogeneity. We showed that mean disparity index (DI) estimated from all pairwise comparisons for each gene alignment provides a useful metric to rapidly assess the degree of compositional heterogeneity in multiple gene partitions 33 . A total of 200 loci (46,677 sites) with the lowest mean DI (0.0096–0.078) were selected using MEGA5 64 .
Highest AT content (analyses 3 and 35). Romiguier et al. 34 showed that GC-rich genes result in higher levels of gene-tree error and incongruence relative to AT-rich loci. Percentage AT for each locus alignment was estimated using Geneious and a set of 200 loci (52,809 sites) with the highest AT content (49–60%) was selected.
Assessment of tree inference accuracy and precision
We conducted three different analyses to assess the collective performance of exon markers in multi-locus analyses and the extent of estimation error in individual gene trees. First, we gauged the power of the data to resolve and support taxonomic groups (otophysan orders, suborders, and families) that are undisputed in the literature and recognized on the basis of ample morphological and molecular evidence. These groups are independent from our central hypothesis tested. The presence of expected clades in multi-locus analyses and individual gene trees was assessed using the R package MonoPhy 65 . Twelve families represented by only one individual in this study were not tested (Supplementary Fig. 1). Second, we analysed discordance among 1,051 gene trees by graphically representing their dispersion in tree space in comparison with the 45 multi-locus trees. This test used multidimensional scaling (MDS) based on un-weighted Robinson–Foulds distances 32 as implemented in the TreeSetViz module in Mesquite 66 . The MDS analyses were conducted separately for DNA- and protein-based trees. Third, we estimated average support values across all nodes in the corresponding trees using the R package Ape.
All alignments were concatenated into a single super-matrix for phylogenetic analysis based on the complete data set (1,051 loci) or on subsets described above (Supplementary Database 10). For all data sets, partitioned RAxML analyses (by gene and by codon position), were replicated 30 times and the best-scoring tree across searches was selected. DNA analyses used the GTRGAMMA model and protein analyses the PROTGAMMAWAG model in RAxML. Branch support was assessed using the rapid bootstrap algorithm with 300 replicates under the previous models; the collection of bootstrapped trees was used to draw bipartition frequencies onto the optimal tree. Additional unpartitioned analyses for the complete data sets (1,051 loci) were conducted using FastTree-2 67 under the GTR (DNA) and WAG (protein) models; FastTree local support values were estimated with the Shimodaira-Hasegawa test 35 .
Bayesian analyses were run using ExaBayes v1.4.1 68 under the GTRGAMMA (DNA) and PROTGAMMAWAG (protein) models, with branch lengths linked across partitions. Two independent MCMC runs were conducted from random starting topologies sampling every 500 generations. ExaBayes runs continued until the termination condition of mean topological differences was less than 5% with at least 500,000 generations. Posterior distributions of trees were summarized using the ‘consense’ function with default burn-in. Convergence was assumed when all parameters had effective sampling sizes (ESS) greater than 200 estimated with Tracer v1.5 69 . In addition to model-based inference approaches, parsimony searches were performed for the complete nucleotide alignment in TNT v1.0 70 . The runs used the ‘new technology’ search option, with sectorial, ratchet and tree-fusing methodologies, with default parameters. To assess branch support, 100 bootstrap searches were performed via TBR branch swapping (summarized in a consensus tree).
Gene tree inference
Individual gene trees were inferred using RAxML and ExaBayes, as explained above. To assess performance in gene tree estimation between these two methods, we computed Robinson Foulds (RF) distances among each gene tree and a reference topology (estimated with the complete concatenated data set). RAxML produced gene-trees with smaller dispersion in tree topology relative to ExaBayes (smaller average RF distances); therefore, RAxML gene trees were used for downstream analyses.
Summary coalescent species-tree inference
Species-tree analyses were conducted for all data sets using ASTRAL-2 (Database S10). This method uses unrooted gene trees as input and maximizes the number of quartet trees shared between the gene trees and the species trees. ASTRAL-2 has been shown to outperform other summary methods under different levels of incomplete lineage sorting. To account for gene tree estimation uncertainty and to assess clade support, we used 100 RAxML bootstrapped gene trees for each gene (as described above) as input for ASTRAL-2. Additional summary coalescent analyses were performed using STAR and NJ-ST 71,72 , as implemented in the STRAW server 73 ; complete data sets only). All STAR and NJ-ST analyses were rooted using Danio rerio; all other cypriniform taxa were excluded. Details on assessment of tree inference accuracy and precision are given in the Supplementary Methods.
Gene genealogy interrogation (GGI)
The GGI tests implemented require three major steps. First, we define a set of hypotheses to test: for our study, this includes the 15 possible topologies (Fig. 1) for the major lineages of otophysans (undisputed monophyletic groups: Cypriniformes, Gymnotiformes, Siluriformes, Characoidei, and Citharinoidei). Topological constraints enforcing these 15 hypotheses were defined to obtain 1,051 ML genes trees tree consistent with each hypothesis. Site likelihood scores for each tree were obtained with RAxML. Second, a topology test was conducted for each gene by statistically comparing the site likelihood scores of all 15 trees via the approximately unbiased (AU) test 35 as implemented in CONSEL v0.1 55 . The AU test uses multi-scale bootstrapping techniques and can be applied to simultaneous comparisons of multiples trees to estimate a P-value for each topology. Finally, trees were ranked according to the P-values and visualized using R plots supporting each hypothesis with highest probability. A tutorial for conducting all GGI steps using custom code is provided in the SI Text (Supplementary Databases 11 and 12).
Data sets analysed using GGI
In addition to the newly generated genomic set for otophysans (a five-taxon problem involving 15 possible topologies; Figs 1 and 2), four published data sets addressing controversial phylogenetic questions were analysed using GGI (Supplementary Table 3 and Database 13).
Metazoa (protein). A four-taxon problem involving three possible topologies (Fig. 2c). The metazoan data set analysed was compiled by Whelan et al. 40 and consists of 76 taxa and 209 genes. These authors assembled 25 alternative data sets, and this study examined their data set 12, which applies the most stringent filter for selection of orthologous loci (that it, ‘certain’ and ‘uncertain’ paralogues excluded 40 ). It also comprises the broadest taxonomic sampling including distant outgroups such as fungi. Some studies assessing early metazoan relationships exclude distant outgroups to avoid potential artefacts caused by long-branch attraction 74 . However, this is not a concern in this study as GGI constrains the ingroup (animals) to be monophyletic.
Neoavian birds (DNA). A six-taxon problem involving 105 possible topologies, of which only eight competing hypotheses were assessed (Fig. 2d). This study examined the data set of Prum et al. 43 , which includes 259 loci (consisting of exons and flanking introns) sequenced for 198 bird species. To reduce computational burden, this data set was subsampled to include a subclade in the Neoavian radiation where the mousebird is placed. The lineages sampled comprise Accipitrimorphae (7 species), Australaves (55 species), Coraciimorphae (23 species), owls (2 species), mousebirds (2 species) and one outgroup of the family Optisthocomidae.
Eutherian mammals (DNA). A four-taxon problem involving three possible topologies (Fig. 2e). The data set examined was originally assembled by Song et al. 75 , consisting of 447 genes and 37 mammalian species. We used a recent correction of this data set 46 , which relabeled two taxa inadvertently mislabeled in the original data set. We also excluded eight duplicate loci and 26 loci with misaligned sequences, following Springer et al. 20 . The data set examined consists of 413 genes.
Yeast (protein). A four-taxon problem involving three possible topologies (Fig. 2f). The yeast data set consists of 23 species and 1,070 exons assembled by Salichos and Rokas 4 , with loci selected based on synteny and orthology information obtained from two genomic databases for yeasts.
Data that support the findings of this study have been deposited in Zenodo (http://dx.doi.org/10.5281/zenodo.51603).
How to cite this article: Arcila, D. et al. Genome-wide interrogation advances resolution of recalcitrant groups in the tree of life. Nat. Ecol. Evol. 1, 0020 (2017).
We dedicate this contribution in honour and memory of our friend and valued colleague Richard Vari whose untimely death has left a huge lacuna in the world of otophysan systematics. We thank D. Maddison, for helping with the MDS analyses in Mesquite, and R. Rivero, for helping with illustrations. We also thank S. Edwards and T. Warnow for providing extensive comments on earlier versions of the paper. J. P. Sullivan kindly provided a photograph for Citharinoidei. This work was supported by National Science Foundation (NSF) grants (DEB-147184, DEB-1541491) to R.B.R., (DEB-1457426 and DEB-1541554) to G.O., (DEB-0315963 and DEB-1023403) to J.W.A., and (DEB-1350474) to L.J.R. This project was also funded by the Opportunity Research Program between George Washington University and the Natural History Museum (Smithsonian) to G.O. and R.V and the Smithsonian Peter Buck fellowship to R.B.R.
Supplementary Methods, Supplementary Notes, Supplementary Figures 1–7 and Supplementary Tables 1–8.