Introduction

The actinopterygians (ray-finned fish) comprise approximately 28,000 extant species. This group is one of the major vertebrate groups, including nearly half of all extant vertebrate species1. Currently, according to molecular, morphological and paleontological studies, the actinopterygians, including 44 orders and 453 families1, are interpreted as a taxon comprising four major groups: cladistians, chondrosteans, holosteans and teleosteans2,3,4. Considerable effort has been made over a long time to resolve the phylogeny of actinopterygians based on both morphological and molecular data. However, the phylogenetic relationships among the major groups of actinopterygians were still controversial and unresolved, as are many of the proposed higher-level taxa within the Teleostei (e.g.,5,6). Debates on the ordinal relationships among basal euteleosts and on the most species-rich lineage, the Acanthomorpha, have long continued, although several new findings in molecular biology agree with results derived from morphological studies7,8,9. One of the major questions in actinopterygian phylogeny is the pattern of phylogenetic relationships among the higher “perch-like” fish, the order Perciformes and relatives (e.g.6,10,11). The monophyly of certain orders and families is in doubt and this difficulty creates even greater problems1.

Previous studies of actinopterygian phylogenies on the basis of nuclear genes focused primarily on particular groups and/or were usually based on relatively few markers. Even within the same species group, different gene markers have resulted in controversial phylogenies in certain cases. For example, MasonGamer and Kellogg found that gene trees of the grass tribe Triticeae resulting from four different single-gene data sets disagreed extensively in their intergeneric relationships12. Another study using four nuclear and two mitochondrial loci individually obtained different phylogenies among 17 Oriental Drosophila melanogaster species13. Rokas et al. selected 106 widely distributed orthologous genes from eight yeast genome sequences and concluded that a single or a small number of concatenated genes had a significant probability of supporting conflicting topologies, whereas more than 20 genes combined might yield a single, fully resolved species tree with maximum support14. Nevertheless, increasing the number of genes for accurate phylogenetic inferences inevitably constrains the number of analysed taxa and increases the percentage of missing data because of many limitations, such as time and resources. Furthermore, based on the aforementioned datasets used by Rokas et al., Phillips et al. obtained 100% supported but mutually incongruent trees using different tree-reconstruction methods and suggested that this inconsistency resulted from a compositional bias15. For all these reasons, phylogenomic approaches in systematics based on the analysis of multi-gene sequence data are becoming increasingly common because large numbers of characters and independent evidence from many genetic loci often result in well-resolved and highly supported phylogenetic hypotheses14,15,16. Furthermore, recent simulation and empirical studies have suggested that increases in gene sampling resulted in better performance than increases in taxon sampling17,18,19 and phylogenetic reconstruction appeared not to be sensitive to highly incomplete taxa as long as a sufficient number of characters were available20,21,22,23. Another advantage of phylogenomics is that the increasing throughput capacity of DNA sequencing technology has made available an ever-growing amount of sequence information, primarily in the form of large collections of expressed sequence tags (ESTs) or genome sequences. Phylogenetic inferences using a multi-locus approach, especially based on ESTs, are extensive because the use of ESTs can produce large numbers of gene sequences relatively easily and economically and can yield reliable and robust results24,25,26,27,28. Recently, Hittinger et al. sequenced transcriptomes of 10 mosquito species using the second-generation sequencing technologies and obtained robust phylogenetic inferences. They claimed this approach was an efficient, data-rich and economical option for generating large numbers of orthologous gene alignments for multi-locus phylogeny inference29. In view of these results, it is possible that robust phylogeny inferences for actinopterygians can be resolved by multi-gene approaches using multi-origin expression data.

Actinopterygians have been the group of vertebrates with the second best characterised genomes. Five fully sequenced and high-quality genomes are available for actinopterygians: Danio rerio (zebrafish), Gastroceus aculeatus (three-spined stickleback), Oryzias latipes (Japanese medaka), Takifugu rubripes (Japanese pufferfish) and Tetraodon nigroviridis (green spotted pufferfish). Additionally, many EST sequencing projects for a wide variety of teleost species have been conducted worldwide and hundreds of thousands of EST sequences are available. However, current deep phylogenetic studies of actinopterygians are primarily based on mitochondrial genomic data. Studies of this type based on nuclear genes are rare, especially in association with large-scale expression data. In the present study, the transcriptomes of three basal actinopterygians (Lepisosteus osseus, Polyodon spathula, and Polypterus delhezi) and two cypriniforms (Hypophthalmichthys molitrix, Hypophthalmichthys nobilis) were sequenced using the second-generation sequencing technologies (see Materials and Methods). Based on expression data generated in this study and on the results of previous genome and EST sequencing projects, we obtained multi-locus orthologous gene alignments for 17 of 44 orders within the class Actinopterygii. Subsequent analyses were performed to resolve the relationships among these species on the basis of these alignments.

Results

Sequence analyses and alignment

The transcriptome sequences used in this analysis for three basal actinopterygians and two cypriniforms were generated by us de novo (additional information in supplemental table S1). Transcriptome sequences, ESTs, mRNAs, Unigenes or cDNAs for 21 other species were downloaded from public databases (see methods). Based on these multi-origin expression data, we obtained 274 orhtologue groups (OGs) using OrthoSelect. The data profile for each species used in this study is shown in Table 1. Information for each OG (the number of species, length of alignment, percentage of missing data, best-fitting models of protein sequence evolution and accession number for each sequence) is given in supplemental table S2. The alignment files generated for phylogenetic analyses are given in supplemental file S1. The distribution of the alignment lengths of the 274 OGs is shown in Figure 1. The modal value of the alignment lengths appears to be in the range of 200–800 bp, with more than 90% shorter than 900 bp. Only 6 OGs had alignment lengths longer than 1000 bp and the mean length of all orthologues was 496 bp. There was a bias against obtaining longer alignments (the majority of the alignment lengths were approximately 500 bp). The reason for this outcome may be that most of our sequences were obtained directly from expression data rather than complete sequencing. The proportions of missing data for our OGs ranged from 10.0% for OG2806 to 62.7% for OG1174. The total number of OGs and percentages of missing data for each species are shown in Table 1. The missing data within these species ranged from 4.42% (Danio rerio) to 84.86% (Lepisosteus osseus). The nucleotide supermatrix concatenated from these 274 OGs included 135,969 bp and entirely missed 38.9% of the nucleotides. The average nucleotide composition of the concatenated supermatrix sequences was A = 27.1%, C = 24.6%, G = 27.0% and T = 21.3%.

Table 1 Data profiles for each species used in the study
Figure 1
figure 1

Distribution of nucleotide alignment lengths of the 274 orthologue groups.

Phylogeny inference based on nuclear multigenes

The concatenated nucleotide (excluding the third codon positions) and its conceptually translated amino acid genetic datasets were subjected to both Maximum Likelihood (ML, partitioned and unpartitioned) and Bayes Inference (BI, only unpartitioned) analyses and produced a consistent topology with similar phylogenetic support values. Almost all nodes were fully supported by posterior probabilities for BI. For ML, the node for the two perciforms, Dicentrarchus labrax (European seabass) and Sparus aurata (gilthead seabream), as sister group was not highly supported by the bootstrap values (Figure 2 and supplemental Figure S1 A–E). Both the AIC (Akaike information criterion) and the AICc values30 showed that the likelihood value with the partitioned supermatrix was better than the value with the unpartitioned supermatrix for the nucleotides. For the protein sequences, however, the likelihood value with the unpartitioned supermatrix was better than the value with the partitioned supermatrix. Interestingly, we reconstructed almost the same topology (supplemental SFigure 1 F and G) and the only difference was the placement of Oreochromis niloticus (Nile tilapia) based on the concatenated nucleotide supermatrix including the third codon positions. We recovered a monophyletic clade including Gasterosteus aculeatus (three-spined stickleback), Anoplopoma fimbria (sablefish), Sebastes caurinus (copper rockfish), Dissostichus mawsoni (Antarctic cod) and Hippoglossus hippoglossus (Atlantic halibut) with high confidence. Specifically, Gasterosteus aculeatus (Gasterosteiformes) and Anoplopoma fimbria (Scorpaeniformes) formed a sister-group relationship and Sebastes caurinus (Scorpaeniformes) and Dissostichus mawsoni (Perciformes) formed another monophyletic group with Hippoglossus hippoglossus (Pleuronectiformes) branched basal to this clade. The order Tetraodontiformes was placed as the most primitive taxon within Percomorpha (except Oreochromis niloticus). Figure 2 also shows that Fundulus heteroclitus and Oryzias latipes are sister, with Oreochromis niloticus branched basal to this clade. The monophyly and placement of major taxa such as Teleostei (Elopomorpha + Ostarioclupeomorpha or Otocephala + Euteleostei), Ostarioclupeomorpha (represented by Siluriformes + Cypriniformes), Acanthomorpha (Acanthopterygii (Atherinomorpha + Percomorpha) + Paracanthopterygii), which have been accepted extensively, were supported strongly by our analysis. The clade Protacanthopterygii ((Esociformes + Salmoniformes) + Osmeriformes) was recovered as monophyletic, with the Esociformes and the Salmoniformes as sister groups. As for the major actinopterygian clades, our results supported the topology (Polypteriformes, (Acipenseriformes, (Lepisosteiformes + Teleostei))).

Figure 2
figure 2

The best-scoring maximum-likelihood (ML) tree derived from the concatenated supermatrix of the 274 nuclear genes (90,646bp, excluding the third codon positions) from the 26 actinopterygians with the GTRGAMMA model implemented in RAxML.

Numbers besides internal branches indicate bootstrap values based on 100 replicates. Other phylogenetic tree reconstruction strategies implemented in this report all obtained the same topology as this and are shown in supplemental Figure S1.

Discussion

The extant basal actinopterygians include four major lineages, the Polypteriformes, Acipenseriformes, Lepisosteiformes and Amiiformes. Although their basal positions within the actinopterygians have been consistently accepted by previous investigators1, considerable controversy over their relationship to the teleosts continues2,4,8. We conducted a comparative analysis of the phylogenetic positions of three lineages of basal actinopterygians (Polypteriformes, Acipenseriformes and Lepisosteiformes) relative to the teleosts with former hypotheses (please refer to Arratia 200131, who presented all possible morphological and molecular hypothesis and also to Arratia 200432). Our topology was in accordance with a previous conclusion based on gill-arch structure33 and with the first published significant hypothesis on the basal actinopterygian relationships based on molecular data34. Many recent conclusions based on morphological and molecular data were also consistent with our topology35,36,37. In contrast, previous findings that acipenseriforms or lepisosteiforms are more closely related to teleosts based on mitogenomic data8 or molecular synapomorphies38 were weakly supported by our topological test (Table 2). Currently, the polypteriforms (e.g., armored bichir) are widely accepted as the sister group of all other extant actinopterygians1. However, because the results we presented here did not include Amia calva in the analysis, this conclusion may be subject to bias and may require further investigation.

Table 2 Results from AU tests and SH tests among alternative tree topologies derived from analysis of nucleotide supermatrix of 274 OGs

In addition to the basal actinopterygians, all other fishes in this study are collectively included within the Teleostei (Figure 2), which was represented by three main groups here: Elopomorpha, Ostarioclupeomorpha ( = Otocephala) and Euteleostei. Generally, researchers agreed that the protacanthopterygians occupy a phylogenetic position intermediate between the basal teleosts (ostarioclupeomorphs and below) and neoteleosts (stomiiforms and above)9 and are interpreted as basal Euteleostei. Because many of the morphological characters of the group have a mosaic distribution, the composition of this assemblage has undergone numerous changes over the past many decades1. Additionally, the deep relationships of the protacanthopterygians are so complex and controversial1,9 that at least 10 different phylogenetic hypotheses have been proposed (Figure 3 A–J; note that argentinoids are not shown because they are absent from our analysis. For more information, see Ishiguro's figure 1 A–J9, Springer & Johnson's figure 339 and Diogo's figure 240). Topological tests strongly suggested that our placement of the protacanthopterygians and related lineages was correct and confidently rejected other dichotomous ones (Table 2). Among these hypotheses, the phylogenetic position of the esociforms is one of the most controversial9,41. Our analysis strongly supports the hypothesis that the sister taxa of the esociforms were the salmoniforms rather than Neoteleostei39,42 or Osmeriformes40. This sister-group relationship is in accordance with many morphology-based and nearly all molecular-based hypotheses. Ramsden et al. corroborated this sister-group relationship from other perspectives, such as the life history and distribution of the fishes43. However, the placements of other lineages in these hypotheses are different from ours. For instance, the placement of Neoteleostei in our hypothesis is obviously different from the placement in earlier hypotheses except for that of Rosen44. Based on his morphological studies, Rosen suggested that protacanthopterygians were a monophyletic unit and that Protacanthopterygii and Neoteleostei formed a sister group (Fig 3A). This hypothesis is the same as ours. However, his placement of ostariophysans as a sister group to Protacanthopterygii and Neoteleostei was different from ours. Recently, several hypotheses based on mitochondrial data obtained the same topology as that found by our study. In fact, in the study of Ishiguro et al., the monophyly of protacanthopterygians cannot be rejected based on mitogenomic data if alepocephaloids are excluded and monophyly is enforced for the remaining groups of protacanthopterygians9. Before them, almost all morphology-based analyses consistently treated alepocephaloids and argentinoids, two suborders of the order Argentiniformes, as sister groups. However, Ishiguro et al.'s mitogenomic phylogenetic analysis argued that alepocephaloids were nested within the otocephalans with high statistical support9. Therefore, the phylogenetic position of these two lineages required further investigation.

Figure 3
figure 3

Ten alternative phylogenetic hypotheses for basal euteleosts published after Rosen (1974).

A-H were modified from Ishiguro et al. (2003), I was modified from Diogo (2008) and J was modified from Fu (2010) and Broughton (2010). All terminal taxa were standardised to the three major protacanthopterygian lineages analysed in the present study (indicated by bold face).

Many taxa within the Euteleostei (minus Protacanthopterygii) that had true spines in the dorsal, anal and pelvic fins are included within the Acanthomorpha1. The superorder Acanthopterygii, which contains 13 orders, 267 families, 2,422 genera and approximately 15,000 species, can be divided into three large assemblages (termed Series, i.e., Mugilomorpha, Atherinomorpha and Percomorpha) and is the most species-rich superorder within this taxon1,45. Although many morphological and molecular studies have been conducted, the relationships among major lineages within the Acanthomorpha remain poorly defined1,6,7,10,11,45,46,47. In addition, certain orders and families within this assemblage are not monophyletic and this made the situation even worse1. In this study, we intended to test the possibility of recovering their relationships using many genes rather than resolving them thoroughly. The monophyly of the series Atherinomorpha, containing the Atheriniformes, Beloniformes (including the Adrianichthyoidei) and Cyprinodontiformes has been consistently suggested1,48. Similarly, Japanese medaka (Beloniformes) and killifish (Cyprinodontiformes) were grouped as sister groups with high confidence in this study. Moreover, we also recovered that one scorpaeniform fish was more closely related to the Antarctic cod (Perciformes), whereas the other scorpaeniform represented the sistergroup of three-spined stickleback (Gasterosteiformes). Certain species within Perciformes appeared more closely related to the orders Pleuronectiformes, Scorpaeniformes and Gasterosteiformes, but another species (Oreochromis niloticus) was more closely related to Atherinomorpha. This result is consistent with previous studies that proposed that Scorpaeniformes and Perciformes may not be monophyletic1,45,49. Interestingly, in a previous study based on mitogenomic sequences, Miya et al. found that internal branches among Percomorpha were only weakly supported but that members of Gasterosteiformes and Scorpaeniformes formed a strongly supported monophyletic group with a bootstrap value of 100%46. Moreover, the affinity of the cichlids with members of the Atherinomorpha has been consistently supported by studies based on nuclear genes17,50,51,52 and mitochondrial genomes35,37,48,53. This phylogenetic affinity is also supported by a unique egg morphology and spawning mode48. We recovered the tetraodontiforms as pre-perciforms with high confidence (Fig. 2). This result was in accordance with Springer and Johnson's finding, which was based on morphological studies39. However, evidence suggests that Scorpaeniformes (including the Dactylopteridae), Pleuronectiformes and Tetraodontiformes were most likely derivatives of perciform lineages1. Accordingly, our placement of Tetraodontiforms may be an artifact resulting from sparse taxonomic sampling of those species. Our multi-gene analysis recovered the relationships among most of these lineages. Nevertheless, many questions regarding the relationships among lineages within Acanthomorpha remain unanswered. For example, the monophyly of the Paracanthopterygii, the sister group of Atherinomorpha and Tetraodontiformes, the phylogenetic placement of Batrachoidiformes and the relationships among lineages within Percomorpha have long been controversial1. The last-named question poses particular difficulties because the monophyly of these groups is questionable and phylogenetic conclusions will depend on the choice of representatives50.

The deep phylogeny of actinopterygians is a long-standing and complex problem in the study of fish evolution. In this study, our taxon sampling for basal actinopterygians was purposefully chosen, but the information used for teleosts was based primarily on expression data available on public databases. We showed that phylogenomics based on integrating multi-origin expression data can recover their phylogeny with high confidence and that the major topology we obtained is consistent with that found by most previous studies. Moreover, the question of missing data is a significant problem for large-scale phylogenomic analysis. Philippe et al. showed that a supermatrix alignment with 25% missing data can still confidently resolve the phylogeny of eukaryotes21. In the case of actinopterygian phylogeny, an alignment with 38.9% missing data can result in a correct topology with high support. These results suggest that even with insufficient taxon sampling and several data gaps, large-scale phylogenomics based on integrating multi-origin expression data can produce a relatively good resolution of the the deep phylogeny of actinopterygians. Further investigations based on more purposefully chosen species may completely reconstruct the relationships of actinopterygians and provide a reliable phylogenetic framework for studying actinopterygian evolution.

Methods

Data collection and processing

Transcriptome sequences of five ray-finned fish species, Hypophthalmichthys molitrix (silver carp), Hypophthalmichthys nobilis (bighead carp), Lepisosteus osseus (longnose gar), Polyodon spathala (spoonbill cat) and the outgroup, Polypterus delhezi (armored bichir) were originally generated by Solexa sequencing in this study. Specimens of these species were purchased from a commercial source. The total RNA of each species was extracted from pooled organs with Trizol (Invitrogen, Carlsbad, CA, USA) according to the manufacturer's instructions. Poly (A+) RNA isolation, cDNA synthesis, preparation, sequencing (on an Illumina Genome Analyzer) and assembly (using the SOAP software package54) were performed at Beijing Genomics Institute. The assembled transcriptome sequences of European eel (Anguilla anguilla) were downloaded from EeelBase (http://compgen.bio.unipd.it/eeelbase/).

ESTs and/or mRNAs of Anoplopoma fimbria (sablefish), Dicentrarchus labrax (European seabass), Dissostichus mawsoni (Antarctic cod), Esox lucius (Northern pike), Hippoglossus hippoglossus (Atlantic halibut), Osmerus mordax (rainbow smelt), Sebastes caurinus (copper rockfish) and Sparus aurata (gilthead seabream), were downloaded from the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov, GenBank status on 23 Dec 2009). Unigenes for Fundulus heteroclitus (killifish), Gadus morhua (Atlantic cod), Ictalurus furcatus (blue catfish), Ictalurus punctatus (channel catfish), Oreochromis niloticus (Nile tilapia), Pimephales promelas (fathead minnow) and Salmo salar (Atlantic salmon) were also downloaded from this database (GenBank status on 23 Dec 2009). Various contaminants and low-quality and low-complexity sequences within these data were screened and trimmed using SeqClean (http://compbio.dfci.harvard.edu/tgi/software) with NCBI's UniVec as a screening file.

Complementary DNA sequences of five model fish species, Danio rerio (zebrafish), Gasterosteus aculeatus (three-spined stickleback), Oryzias latipes (Japanese medaka), Takifugu rubripes (Japanese pufferfish) and Tetraodon nigroviridis (green spotted puffer), were retrieved from Ensembl (http://www.ensembl.org/, RELEASE62).

Sequence selection and alignment

Orthologue assignments were achieved using the slightly modified OrthoSelect method55 in this study. The default reference database of OrthoSelect was KOG (clusters of euKaryotic Orthologous Groups) and OrthoMCL, which included non-fish species. We know that teleosts have experienced the fish-specific genome duplication, which may result in “one2two” or “one2many” orthology relationships between teleosts and other species. To overcome this problem and to identify the orthology relationships unambiguously, we'd better use “one2one” orthology relationships as references. Therefore, we downloaded amino acid sequences of five model fish and their “one2one” relationships from Ensembl using BioMart. Each of these “one2one” sequence sets was termed an orthologue group (OG) in this study and the expression data were assigned to these OGs by a BLASTX analysis of individual EST sequences against all OG proteins. After the OG assignment, each sequence was translated using ESTScan56, GeneWise57 and a standard six-frame translation using BioPerl and aligned to the best hit from the previous BLAST search using bl2seq58. The translated sequence with the lowest E-value was chosen as the correctly translated sequence. Subsequently, one sequence from each organism was selected to represent the most probable ortholog to each other in accordance with their strategy based on matching positions normalized by its length in pairwise comparisons with MUSCLE59. However, because many ESTs were low-quality and included some frameshift errors or premature stopcodons, plus the limitations of bl2seq, we may discard the true ortholog in some species. To overcome these problems, we translated the expression data into protein sequences using ESTScan and found the best sequence from each database using hmmbuild and hmmsearch from the HMMER package60. After HMM selection, we obtained the orthology relationships for each OG. Then, we chose a model fish sequence and translated it into protein sequence and compared it to its orthologues separately with GeneWise (Only orthologue with a score more than 100 was retained). A customized Perl script was then used to extract matched nucleotides and to generate a sequence alignment for each OG. If a sequence was assigned to more than one OG, we discarded all these OGs to avoid any ambiguity. The OG alignments having more than 14 sequences were visually inspected and adjusted by hand using Bioedit (http://www.mbio.ncsu.edu/BioEdit/bioedit.html). Finally, 274 OGs were selected and used for subsequent analyses.

Phylogenetic analysis

The nucleotides (excluding the third codon positions) and the conceptually translated amino acid alignments of these OGs were each concatenated, respectively. Both of the two supermatrices were subjected to subsequent Bayesian inference (BI) and Maximum Likelihood (ML) analyses. BI was performed with the MPI version of MrBayes 3.1.261, in which Markov Chain Monte Carlo (MCMC) calculations were spread across multiple CPUs and run on parallel computing architectures. The analysis was initiated from a random starting tree. Two runs with twelve chains of MCMC iterations were performed for 5 million generations (sampling trees every 100 generations) with the GTR + I + Γ models (for MrBayes and protein sequences, we used mixed+ I + Γ) of sequence evolution and the first 20,000 trees (2 million generations) were discarded as burn-ins. The average standard deviation of the split frequencies of the MCMC runs was used as the convergence diagnostic. The 50% majority-rule consensus tree was determined to calculate the posterior probabilities for each node. A parallel version of RAxML 7.2.662 was used for constructing Maximum Likelihood (ML) trees with the GTRGAMMA model for both the partitioned and the unpartitioned supermatrices (for the unpartitioned protein supermatrix, we used the PROTGAMMAJTTF model; the best fitting models of protein sequence evolution for each OG are listed in supplemental table S2). The partitioned supermatrices allow RaxML to assign different parameters for each gene. One hundred replicates for rapid bootstrap analyses62 were also performed with RAxML and a 50% majority rule consensus was calculated to determine the support values for each node. Fianlly, we placed the root at the branch quarter of Polypterus using MEGA563. The best-fitting models of protein sequence evolution were selected by ProtTest2.464. Tests of alternative phylogenetic hypotheses were implemented in CONSEL65.