Introduction

Whole-genome duplication (WGD) events provide organisms with additional copies of their entire genomes, and have long been postulated as the major source of evolutionary innovations that contribute to the diversification of anatomical structures and rapid speciation seen in some lineages (Ohno, 1970; Holland, 2003; Sémon and Wolfe, 2007; Jaillon et al., 2009; Van de Peer et al., 2009). WGD events are distributed unequally across eukaryote phylogeny. They are common in plants (Adams and Wendel, 2005) and several have been reported in fungi (Albertin and Marullo, 2012), but the repeated rounds of WGD seen in the Vertebrata are the most prominent examples yet reported in metazoans, as can be seen in Figure 1. Outside the Vertebrata, only the fast-evolving parthenogenetic bdelloid rotifers (Flot et al., 2013) have been shown to exhibit WGD in the metazoan clade.

Figure 1
figure 1

Whole-genome duplication events across metazoan phylogeny. Xiphosuran, chelicerate and metazoan consensus phylogeny, with arthropod relationships based on Regier et al. (2010) and wider metazoan relationships based on Dunn et al. (2008). Note that some inter-relationships, notably in the Chelicerata, are controversial. Specific clades noted in the text, such as the Chelicerata, Protostomia and Deuterostomia, are indicated by differential shading. Known WGD events are marked with a lightning bolt. Note multiple rounds of WGD have occurred in all lineages thus marked.

The presence of multiple paralogous copies of different developmental gene families, especially the homeobox Hox cluster genes, which are arranged in large syntenic genomic regions, are essential for the discovery and study of WGDs in vertebrates (Amores et al., 1998). Homeobox genes usually contain a 180-nucleotide long motif that encodes a DNA-binding domain, the homeodomain. Most homeodomain-containing proteins act as transcription factors, with roles in a wide range of developmental processes. High degrees of homeobox gene conservation, and in some cases functional conservation, are found across different phyla from fungi, plants and animals/metazoans (Gehring, 1984, 1994; De Robertis, 1994; Holland and Hogan, 1988; Galliot et al., 1999; Bürglin, 2005, 2011; Ryan et al., 2006; Holland et al., 2007, 2012; Larroux et al., 2008; Zhong et al., 2008; Zhong and Holland, 2011a), with the encoded homeodomain generally being the most highly conserved region (Bürglin, 2005, 2011). Owing to their roles as gene regulators and their highly conserved sequences, homeobox genes have become ideal candidates to follow large-scale changes to genomes in evolution (Holland, 1999; 2012).

The homeobox gene superclass is extremely diverse, with over 200 genes in vertebrate genomes and ~100 genes in many invertebrate genomes. Based on the homeodomain sequence, metazoan homeobox genes can be classified into 11 classes—the ANTP (including the Hox genes), CERS (ceramide synthase), CUT, HNF, LIM, PRD, PROS, POU, SINE, TALE (three amino-acid loop extension) and ZF (zinc finger) (Bürglin, 2005, 2011; Holland et al., 2007; Zhong et al., 2008; Holland, 2012). According to phylogenetic analyses and surveys of many genomes, two major classes of homeobox genes, the ANTP and PRD are thought to be confined to the metazoans (Galliot et al., 1999; Bürglin, 2005, 2011; Holland and Takahashi, 2005).

The ANTP-class homeobox genes are named after the Antennapedia gene of Drosophila melanogaster. This family of genes can be recognised by their distinctive homeodomain, and in some cases, also five or six amino acids constituting the hexapeptide (or pentapeptide) motif upstream of the homeodomain (Bürglin, 2005, 2011; Holland et al., 2007; Holland, 2012). Nomenclature within the ANTP-class homeobox genes is complex, and includes Hox, Hox-like (or Hox-linked), Hox/ParaHox related, extended Hox, NK and NK-like or NK-linked genes (Holland et al., 2007; Ferrier, 2008; Hui et al., 2012). Comparing the chromosomal positions of the ANTP-class homeobox genes in human and mouse, Pollard and Holland (2000) first postulated that the whole ANTP-class of homeobox genes originated from a single ancestral gene, via an ancient mega-ANTP-class homeobox cluster (megacluster) that existed deep in metazoan ancestry (for details, see Pollard and Holland, 2000; review by Garcia-Fernàndez, 2005). Mapping of ANTP class homeobox genes in the cephalochordate, Branchiostoma floridae and polychaete Platynereis dumerilii has provided further support for the hypothesis (Castro and Holland, 2003; Luke et al., 2003; Hui et al., 2009, 2012).

The PRD class homeobox genes derive their name from the paired gene of D. melanogaster (Bürglin, 2005, 2011; Holland et al., 2007), and are further subdivided into the PAX and PAXL subclasses found throughout the animal kingdom (Galliot et al., 1999; Ryan et al., 2006; Holland et al., 2007; Hoshiyama et al., 2007; Larroux et al., 2008). Unlike ANTP class genes, PRD genes are usually dispersed; however, two PRD-homeobox clusters have been proposed—the mammalian Rhox homeobox gene cluster (MacLean et al., 2005) and a conserved cluster containing homeobrain, rx and orthopedia (Mazza et al., 2010).

Recently, however, a number of investigations in the Chelicerata have provided circumstantial evidence for the possibility of one or more WGD events in that clade. These include the discovery of multiple copies of homeobox-containing genes in the spider Cupiennius salei (Schwager et al., 2007), the scorpion Centruroides sculpturatus (Sharma et al., 2014a) and the horseshoe crab Limulus polyphemus (Nossa et al., 2014). The latter paper showed by a variety of methods, including linkage mapping, comparison of age distribution of paralogous genes and gene cluster comparison, that WGD occurred in L. polyphemus. However, whether WGD is limited to L. polyphemus alone, or is more commonly found in the Xiphosura, is uninvestigated, and would have a range of implications for our understanding of the prevalence of WGD in the Chelicerata. The Chelicerata are a diverse clade, found both on land and in aquatic habitats and although many chelicerate lineages are strikingly diverse both in terms of number of species and morphological and ecological adaptations, other groups exhibit a remarkable morphological stasis. The marine chelicerate horseshoe crabs, of the Order Xiphosura, are a prominent example of the latter situation.

With the exception of details in abdominal segment and biramous limb morphology (Briggs et al., 2012) living horseshoe crabs, such as Carcinoscorpius rotundicauda (Figure 2a), are remarkably similar in appearance to fossil species 400 million years in age, and almost indistinguishable to some found in the Jurassic (Briggs et al., 2005). In addition, very limited diversity has been catalogued in the Xiphosura across geological time (Störmer 1952; Störmer 1955; Anderson and Selden, 1997). Present xiphosuran diversity is limited to just four species, L. polyphemus found in the Atlantic, and C. rotundicauda, Tachypleus gigas and Tachypleus tridentatus found in Southeast Asia. These species separated from one another relatively recently, ~135 million years before the present day (Figure 2b, Obst et al., 2012).

Figure 2
figure 2

Xiphosuran biology and homeodomain-containing gene number. (a) Adult Carcinoscorpinus rotundicauda, oblique and ventral view. (b) Xiphosuran phylogeny and divergence time after Obst et al. (2012). MYA=millions of years ago. (c) Homeobox genes extracted from sequenced horseshoe crab species with phylogenetic evidence of identity shown in bold text. Genes identified but not included in phylogeny are shown in italics, and can be seen in Supplementary File 1.

Here we analyze the homeobox gene complements in the genomes of the three extant horseshoe crab genera (in the species C. rotundicauda, L. polyphemus and T achypleus tridentatus) and provide strong evidence for the presence of at least one WGD event in the xiphosuran lineage. The orthologous relationships of the multiple copies of homeobox genes in all horseshoe crab genomes indicate that the duplication predates the last common ancestor of extant xiphosurans. We present reverse transcription (RT)-PCR and RNAseq-based assays providing evidence of the diversification of Hox gene functionality after this event, with sub- and neo-functionalisation along the anterior–posterior axis of these species. This represents the first evidence and study of the consequences of a WGD event in an ecdysozoan lineage. Such evidence has important implications for how we consider genomic evolution and phenotypic diversification after such events, representing a key comparison point for the vertebrate lineage when assessing the causes and consequences of genomic and phenotypic changes in metazoan evolution.

Materials and methods

Genomic and transcriptomic sequencing

Horseshoe crab samples were acquired commercially from local suppliers in Hong Kong (C. rotundicauda and T. tridentatus) and a collaborator (S Smith) in the USA (L. polyphemus). Genomic DNA samples were extracted from muscle tissue using a PureLink Genomic DNA Mini Kit and the manufacturer’s protocol, and sequenced by BGI (Hong Kong) on Illumina HiSeq2000. A variety of assembly software platforms and settings were trialled empirically before selection of Velvet for generation of a final assembly. Fragment size distributions (nominally 170 bp) were found using Bowtie (Langmead and Salzberg, 2012) on a preliminary assembly (k-mer 51, Velvet 1.2.09 (Zerbino and Birney, 2008)) of genomes. These preliminary assemblies were also used to determine expected genome coverage. Using this information, genome assembly was performed using Velvet 1.2.09 (Zerbino and Birney, 2008), with a k-mer size of 51, minimum fragment length of 200 bp, fragment sizes as noted in Supplementary File 1 and expected coverage as noted in text. The raw read data from the three horseshoe crabs examined in this study have been uploaded to NCBI’s short read archive, and are available under Bioproject Accession PRJNA243016, SRP040718, and assemblies can be downloaded from webpage (tinyurl.com/HSCgenomes) and the Dryad repository linked to this article (url: http://dx.doi.org/10.5061/dryad.81fv1).

mRNA samples for sequencing were extracted from whole chelicerae samples of adult C. rotundicauda and T. tridentatus using Trizol according to the manufacturer’s instructions, followed by RNeasy (Qiagen, Venlo, Netherlands) purification. Samples were sequenced by BGI (Hong Kong) on an Illumina HiSeq2000. Trinity r2013_08_14 (Grabherr et al., 2011) was used to assemble transcriptomes with all default settings.

Gene identification and phylogenetic analysis

Genes were identified in our data set, and in the sequenced genomes of other chelicerates by tblastn with ncbi-blast-2.2.23+ (Altschul et al., 1990) using homeodomain sequences of known identity downloaded from HomeoDB (Zhong and Holland, 2011b). All contig sequences hit with an E-value <10−9 were reciprocally blasted (blastx) to the NCBI nr database to putatively confirm identity. Identified contigs with exonic regions of clear homology with known homeodomain genes were converted to amino-acid sequence using the EMBOSS Transeq server (Rice et al., 2000) and standard codon table. Sequences were aligned using MAFFT (Katoh et al., 2002) under the L-INS-i model to known chelicerate sequences downloaded from GenBank and relevant genome resources, along with B. floridae, Homo sapiens and T. castaneum sequences downloaded from HomeoDB (Zhong and Holland, 2011b). Aligned regions had gaps removed using MEGA (Kumar et al., 2008). Phylogenetic analysis was performed using PhyML (Guindon et al., 2009) and MrBayes (Ronquist & Huelsenbeck, 2003) using models as described in figure legends. Bayesian phylogenies were then displayed in FigTree for visualisation (tree.bio.ed.ac.uk/software/figtree/).

Expression analysis

Walking legs, book gills and the telson of two adult individuals of each of C. rotundicauda and T. tridentatus were excised and samples taken from the most proximal portion of each appendage to the main body mass. Primers were designed to conserve exonic regions using Primer3Plus for each gene, and aligned to other paralogues to ensure specificity. RNA was extracted using Trizol followed by two rounds of phenol/chloroform extraction and RNA precipitation, and reverse transcription performed using Takara Reverse Polymerase. PCR master mixes were prepared for each gene examined, which were split into 18 μl reaction mixes containing 0.5 μM of primer, 200 μM of each dNTP (Takara Bio Inc., Shiga, Japan), 1 × PCR buffer, 3 mM MgCl2 and 2 units Taq (Roche, Penzberg, Germany), to which 2 μl of 10–30 ng cDNA from each appendage was added to individual tubes. PCR was performed using a 3 min 95 °C initial denaturation, 38 × 95 °C/60 s, 60 °C/60 s, 72 °C/60 s steps and a final 72 °C/600 s elongation. Electrophoresis was performed on a 1% agarose gel and visualized under UV with SYBR Safe (Life Technologies, Carlsbad, CA, USA). All primer sequences and gels can be seen in Supplementary File 2.

Results and Discussion

Genome sequencing and assembly

Genome and transcriptome sequencing/assembly results are summarised in Supplementary File 1. Paired end read quality was high, with lower quartile Phred scores above 30 through to the 100th base. The data from the three horseshoe crabs examined in this study have been uploaded to NCBI’s short read archive, and are available under Bioproject Accession PRJNA243016, SRP040718. Using tblastn (Altschul et al., 1990), comparing the 458 gene family CEGMA data set (Parra et al., 2007) to the horseshoe crab genome assemblies with an E-value cutoff of 10−6; 98% recovery of identifiable homologues of the 458 Core Eukaryotic Genes were found in the L. polyphemus, T. tridentatus and C. rotundicauda genomes (448, 450 and 448 Core Eukaryotic Genes, respectively), confirming excellent recovery of the coding areas of these genomes. Five common Core Eukaryotic Genes were absent from all three genomes, and may be particularly divergent or absent in the xiphosuran lineage (Supplementary File 1). The genome sequence depth for all three species is around sixfold, recovering ~1.5 Gbp per species. This is lower than the previously estimated genome sizes of these species of ~2.8 Gbp (Goldberg et al., 1975), which together with CEGMA data suggest the assembled genome data sets are biased towards coding sequence to the exclusion of longer repeat sequences, or perhaps that previous calculations were overestimates. A L. polyphemus genome has been noted online (Nossa et al., 2014), but the data presented here represents a large increase in publically available sequence data for the Xiphosura clade, particularly by including Southeast Asian species.

Multiple copies of Hox genes in all three horseshoe crab genomes

Homeobox genes encode transcription factors deployed in animal development and contain several well-conserved classes such as the ANTP, PRD and SINE genes (Zhong and Holland, 2011b; Holland, 2012; Hui et al., 2012). Genes of the ANTP class, notably the Hox-linked genes, including the well-known Hox cluster genes and their chromosomal neighbors, such as Evx and Mox genes (also known as the extended Hox cluster), are very stable in animal evolution. Owing to their characteristic arrangement into linked gene clusters occupying large chromosomal regions, they have been widely used to identify WGD events and other major genomic changes in animals (Amores et al., 1998; Castro et al., 2006; Tsai et al., 2013; Smith et al., 2013). To date, fully duplicated Hox cluster genes have only been found in species that have undergone WGDs. Examination of Hox and extended Hox cluster genes in the three horseshoe crab assemblies revealed between two and four representatives of almost every gene in each species (Figure 2c, row 1). These numbers could result from a single round of WGD followed by independent duplication or two rounds of WGD followed by loss, the latter being arguably the more likely scenario, given the number of independent duplications that would be necessary across widely separated genomic regions. Our data provide a minimum estimate for these homeobox gene groups in the Xiphosura; there is a possibility that more genes may exist in the genomes than recovered in our analyses. Orthology was confirmed by examining multiple sequence alignments and phylogenetic trees made using Bayesian phylogenetic methods (MrBayes, Ronquist & Huelsenbeck, 2003), including sequences from horseshoe crabs, other chelicerates, humans, the amphioxus B. floridae and beetle Tribolium castaneum (Figure 3 and in Supplementary Files 1, 3 and 4, where nodes have not been collapsed). Where precise prediction of exonic sequence was impaired due to truncated contig size or divergent gene sequence, genes were excluded from phylogenetic consideration, and are shown in italics in Figure 2c. The contig sequences of all genes are listed in Supplementary File 1. Owing to the low level of nucleotide sequence divergence between the three horseshoe crabs species, orthologous relationships between paralogues in the three horseshoe crabs are clear in sequence alignments, such as all investigated Hox genes (Supplementary File 5). This contrasts with the local tandem duplications seen in the ftz and AbdA genes of the Hox cluster in mite (Grbić et al., 2011) and the zen/Hox3 genes of Drosophila and T. castaneum (Falciani et al., 1996; Stauber et al., 1999). Thus, as observed in vertebrate and teleost genomes (Hoegg and Meyer, 2005), all three xiphosuran genomes investigated here also contain multiple copies of all Hox and extended Hox cluster genes paralogous groups, strongly suggesting WGDs in this lineage.

Figure 3
figure 3

ANTP HoxL class gene inter-relationships. Bayesian phylogenetic tree of ANTP HoxL class genes from sequenced xiphosuran species, along with the previously described complements of other chelicerates, amphioxus Branchiostoma floridae, human Homo sapiens and beetle Tribolium castaneum. Phylogenies inferred with the Jones model and 55 informative sites. Numbers at base of nodes represent posterior probability to two significant figures. Trees shown are the result of Bayesian analysis, run for 10 000 000 generations, and analyses run until the s.d. of split frequencies was below 0.01, with the first 25% of sampled trees discarded as ‘burn-in’. All sequences and alignments detailed in Supplementary File 1. Shading shows major gene families. Coloured dots aid in identification of species, with green circles for T. tridentatus, black diamonds for L. polyphemus and red asterisks for C. rotundicauda. Scale bar represents substitutions per site at given distances. Rooted using Pitx and Six 3/6 sequence outgroups.

Multiple copies of other homeobox and homeobox cluster genes in all three horseshoe crab genomes

Evidence for an ancestral WGD in the Xiphosura does not only come from the Hox and extended Hox cluster genes. Multiple gene copies were also found in the three sequenced horseshoe crab genomes for other dispersed groups of homeobox genes that ordinarily exist as single copy in invertebrate genomes and are dispersed over different chromosomal regions in all animals surveyed to date (Supplementary Files 3 and 4, Zhong and Holland, 2011a; Hui et al., 2012). This finding makes it highly unlikely that the Hox gene duplications are the result of local segmental duplication.

For example, multiple paralogs shared between horseshoe crab species are also found in the SINE class tree (Figure 2c, Figure 4). In metazoans, there are three families of SINE class gene—Six 1/2, Six 3/6 and Six 4/5, and most invertebrates possess a single gene copy in each Six family (Zhong and Holland, 2011a; Holland, 2012). In our three sequenced species of the horseshoe crab genomes, multiple copies of members of each of the three SINE gene families are found. Each family has clear posterior probability/bootstrap support for their internal relationships with relatively short branches (for example, 1.0/98 for monophyly of Six 3/6; 0.99/63 monophyly of Six 4/5; 0.99/63 monophyly of Six 1/2). Clear orthologous relationships between some xiphosuran SINE paralogues can be observed (for example, the subfamily A of the Six 1/2 clade has a single representative in every horseshoe crab examined (Figure 4 with single-letter suffix).

Figure 4
figure 4

Phylogenetic tree of SINE class homeobox genes. Phylogenetic tree of SINE class homeobox genes from three xiphosuran species, along with the amphioxus Branchiostoma floridae, human Homo sapiens and beetle Tribolium castaneum. Bayesian/ML phylogenies inferred using the Jones and LG models, respectively, on the basis of 51 informative sites. Numbers at base of nodes represent posterior probability/bootstrap proportions (1000 replicates, expressed as %age) to two significant figures. Trees shown are the result of Bayesian analysis, run for 1 500 000 generations, until the s.d. of split frequencies was below 0.01, with the first 25% of sampled trees discarded as ‘burn-in’. Where ML analysis collapses to a polytomy at the previous node and does not match topology of Bayesian tree, bootstrap proportions are replaced with a ‘*’. All sequences and alignments detailed in Supplementary File 1. Blue-shaded areas indicate SINE gene families. Scale bar represents substitutions per site at given distances. Rooted at midpoint.

Multiple gene copies in the three sequenced horseshoe crab genomes were also found for representatives from other homeobox clusters found in other bilaterians, such as the ParaHox, Hbn/Otp/Rx and NK clusters (additional file 1, Zhong and Holland, 2011a; Hui et al., 2008, 2012; Mazza et al., 2010). Although much less studied than the Hox, ParaHox and NK clusters, the trio of hbn, rx and otp constitutes a well-conserved PRD-homeobox gene cluster whose organization has been maintained in many protostome genomes (Mazza et al., 2010). Together, these results reveal that xiphosuran genomes contain paralogous arrays of genes that are well known for being single copy in other animal species. In a recent study analyzing the hormonal pathway genes in arthropods, it has also shown that the three species of horseshoe crabs contain multiple copies for some of the analyzed genes (Qu et al., 2015). Even if one assumes a particularly high rate of tandem duplication in the horseshoe crab lineage, the presence of multiple copies of genes that are typically arranged into large and complex syntenic and regulatory genomic regions (and thus less prone to tandem duplication) suggests a WGD scenario is more likely than independent duplication of all loci independently. Although gene order evidence would provide even more proof for WGD rather than independent duplication in all species in this lineage, Nossa et al. (2014) provide a strong basis for this claim. Further syntenic data from Asiatic horseshoe crab species would provide a conclusive test for WGD ancestrally in this clade.

It is necessary to ask whether the deduced WGD is a shared ancestral event, or whether WGD occurred independently in three horseshoe crab species. This cannot be easily distinguished using gene numbers. Considering the gene trees as shown above, the presence of orthologous relationships between multiple paralogues of homeobox gene families in the xiphosuran lineage strongly implies that the WGD events that are seen in these three species are shared ancestrally, and not the product of several independent WGD events in the species sequenced here. Although the Chelicerata phylogeny is controversial (Figure 1, Regier et al., 2010; Rota-Stabelli et al., 2013; Sharma et al., 2014b), the diversification of the crown group Xiphosura is well established (Figure 2b). Therefore, the ancestral WGD that we found here would date back to at least 135 million years ago.

WGD may therefore have occurred multiple times in the Xiphosura, but assessing the extent of this is not trivial. Using gene numbers to assess gene duplication is sometimes complicated by confusion between allelic variants and paralogous (duplicate) loci. In the present case, the fact that each species has multiple variants of a gene, and the demonstration that these group into orthologous groups when compared between species (Supplementary File 5), strongly supports the conclusion that these variants are duplicate genes dating to before divergence of currently extant species, not alleles differing within the sequenced individuals. As genome sequences were determined from individual animals, cases with greater than two individual sequences cannot be due to allelic variation alone. Given the presence of four or more paralogues for many genes in all three horseshoe crab species presented here, the occurrence of at least one round of WGD in their common ancestors is more parsimonious than independent duplication of all loci in question.

Subfunctionalisation of Hox gene expression

Using both RNAseq of the chelicerae and RT-PCR on the appendages along the anterior–posterior axis, we have also investigated the impact that WGD had on the expression of Hox genes in adult C. rotundicauda and T. tridentatus, whose lineages diverged approximately 45 million years ago (Figure 2b). We excluded chilaria (rudimentary appendages posterior to last walking legs) from our analysis due to their small size.

Only seven (C. rotundicauda) and four (T. tridentatus) Hox genes were expressed in adult chelicerae tissue (Figure 5). The expression of these genes in the chelicerae of these species was tested by RT-PCR in independent adult samples. In general, T. tridentatus expresses fewer of these genes than C. rotundicauda, such as the lab and Dfd duplicates. In the case Dfd families, no orthologues are expressed in the former species. To further test whether duplicated Hox genes have diverged in function anterior to the chelicerae, expression patterns of the Hox genes Dfd and Scr were examined on appendages along the anterior–posterior axis of adult C. rotundicauda and T. tridentatus by RT-PCR. These two genes were chosen based on their key roles in patterning and maintaining the identity of limbs in arthropods (Telford and Thomas, 1998). In Figure 6a, a variety of contrasting patterns of orthologous genes are shown, reflecting flexibility in remodeling transcriptional regulation, possibly facilitated by WGD (evidence of orthology in Supplementary File 4). For example, in T. tridentatus, expression of Dfd D is consistent between the two species, whereas Dfd A, DfdB and DfdC are expressed in different domains in the two species (Dfd A, DfdB and DfdC are confined to predominantly anterior, walking limb domains in T. tridentatus but are found expressed in the appendages along much of the anterior–posterior axis of C. rotundicauda). The expression patterns of Hox genes shown above confirmed that following WGD events in horseshoe crabs, paralogous genes have undergone subfunctionalisation in these lineages.

Figure 5
figure 5

RNA expression evidence for subfunctionalisation of ANTP HoxL class genes. ANTP HoxL class genes found in chelicerae RNAseq data in two species of horseshoe crab, along with RT-PCR confirmation of expression in two independent adult samples. No RT controls, also shown, confirm that expression derives from RNA and not gDNA contamination.

Figure 6
figure 6

Sub- and pseudofunctionalisation of ANTP HoxL class genes. (a) Subfunctionalisation of ANTP class homeobox genes: figure showing the expression of a number of ANTP class homeodomain genes along the AP axis of adults of two species of horseshoe crabs, Carcinoscorpinus rotundicauda and Tachypleus tridentatus, as determined using RT-PCR as described in methods. Significant subfunctionalisation is seen in the expression of these paralogous genes. (b) Nucleotide (left) and amino-acid (right) alignments of portion of T. tridentatus Ftz genes and transcribed Ftz pseudogene. Note indel, leading to frameshift error in translation. Amino-acid sequences of Emx and Msx genes and pseudogenes identified in genomic sequence. Asterisks indicate where single-nucleotide substitutions have resulted in missense mutations, coding for a premature stop codon. Red arrows indicate location of such signals.

Pseudogenization

A number of homeobox pseudogenes, with clear missense and indel mutations in the homeodomain, were also identified. The sequences of these homeobox pseudogenes have also been independently confirmed in separate individuals to those originally sequenced (data not shown). The process of pseudogenisation of these homeobox pseudogenes most likely occurred at different times in the three species. For example, Msx and Emx genes in T. tridentatus and C. rotundicauda share signatures of pseudogenization (Figure 6b), but the other pseudogenes are not detected in common between these two species. Transcription of a pseudogenised representative of the Hox gene ftz in T. tridentatus can be observed in the RT-PCR-based assays of several independent individuals (Figure 6b, Supplementary File 1). Whether other xiphosuran Hox pseudogenes are transcribed is unknown, but this example suggests the complete ‘silencing’ of pseudogenes may take millions of years after a WGD event, allowing sufficient time for extensive remodelling of regulatory mechanisms (Shakhnovich & Koonin, 2006). Birth and death rates for pseudogenes vary from species to species (Podlaha and Zhang, 2010) and the rates can be heightened by high neutral mutation rate, as observed in mouse (Waterston et al., 2002), or by high genomic deletion rates as seen in the fly (Petrov et al., 1996, 1998).

WGDs in horseshoe crabs, chelicerates and vertebrates

The Xiphosura is generally accepted as the sister group of the Arachnida (Figure 1, Regier et al., 2010, Rota-Stabelli et al., 2013). As shown in Figure 3, Figure 4 and Supplementary Files 3 and 4, the genome of the scorpion Mesobuthus martensii (Cao et al., 2013) possesses multiple copies of many of the homeobox gene families studied here. Furthermore, the spider C. salei and the scorpion C. sculpturatus also possess multiple copies of several Hox cluster genes (Schwager et al., 2007; Sharma et al., 2014a). The presence of WGD in the horseshoe crabs led us to consider whether WGD might have occurred in the chelicerate stem lineage, rather than being limited to the Xiphosura. Our phylogenetic analyses (Figures 3 and 4,Supplementary Files 3 and 4), however, do not find unambiguous orthologous relationships between duplicated homeobox genes in scorpion and spider lineages, or between these animals and horseshoe crabs. With the present data, we cannot draw firm conclusions about the number of WGD events, but suggest that independent WGD events may have occurred in different chelicerate lineages.

In the case of M. martensii the presence of duplicates of a wide range of homeobox gene families strongly suggests that WGD may also be present in the scorpion lineage. In some cases, very weak support (with posterior probability >0.5) is found for orthologous relationships between scorpion M. martensii and Xiphosuran paralogues (for example, Supplementary File 3 Abd B clade). This could be the result of shared WGD, independent WGD or convergence, or more local gene duplication prior to the divergence of these lineages. At present, given contig lengths in our sequence and publically released scaffold lengths available from the M. martensii genome project, it is not possible to resolve by syntenic analysis whether duplication is shared. With present data, we hypothesize that either independent WGD events account for the patterns seen here, or (less likely given present evidence) an ancient chelicerate WGD is shared by scorpions and horseshoe crabs. Previous studies have noted morphological links between xiphosurans and scorpions (Wheeler & Hayashi, 1998) although not normally to the exclusion of other chelicerates; hence, any inference of shared WGD between scorpions and xiphosurans would have profound phylogenetic implications. Our reasons for favouring the first scenario (independent WGD) include the general lack of support found in our trees for shared WGD, and the fact that the spider mite T. urticae, which is generally thought more closely related to scorpions than xiphosurans, lacks any clear sign of WGD (it has a single Hox cluster in its genome, and although there are some expanded highly expanded homeobox genes, these are limited to just a few families (Grbić et al., 2011)).

Spider C. salei Hox paralogues also mirror the pattern of in-paralogue clustering (see Supplementary File 3, C. salei Ubx, Scr, Dfd) suggesting further that duplication of homeobox genes occurred in multiple chelicerate lineages independently. Whether the duplicated spider Hox genes could be indicative of a possible WGD in that lineage is an open question. Whole-genome sequencing in C. salei or further analyses of the two recently published spider genomes (Sanggaard et al., 2014), specifically looking for traces of WGD, would help to solve this question. Overall, these comparative data indicate that WGDs, particularly in chelicerates, are more common phenomena than previously appreciated.

WGD events have been suggested to be a potent source of innovation in genetic regulatory networks, a driver of genomic evolution, and a contributing factor in animal diversification. However, WGD events have a patchy phylogenetic distribution in the Metazoa (Figure 1), and to date the evolutionary implications have only been inferred from vertebrates. In addition to this well-known case, WGD has recently been detected in the bdelloid rotifer Adenita vaga (Flot et al., 2013). However, the obligately asexual reproduction of this species could limit any conclusions drawn about genome evolution when using it as a comparison point with the vertebrates. The discovery of WGD in sexually reproducing horseshoe crabs will thus be useful for wider comparison of patterns of functional gain and loss with other animals, and in particular, with vertebrate WGD.

The presence of multiple rounds of WGD in the Xiphosura mirrors may be in both vertebrates and rotifer. We suggest, therefore, that WGD may be prone to occur in succession in animal lineages that survive genome duplication. Whether this is the result of how WGDs occur in the first instance, or how likely they are to be evolutionarily successful is a question for further consideration.

The contrast between how the Xiphosura (which has been phenotypically consistent since the Cambrian (Rota-Stabelli et al., 2013) and the Vertebrata, (which exhibit extreme phenotypic diversity) have utilised WGD events is stark. Even during the Carboniferous period, when the xiphosuran group experienced its highest diversity peak, the lineage consisted of not >50 species, as evidenced by the fossil record, indicating a very low radiation potential (Anderson and Selden, 1997) although it is possible that the recovered fossil record is an imperfect reflection of actual diversity. Nevertheless, our data are consistent with the idea that WGD events need not result in morphological changes (at least in horseshoe crabs), and could allow the buffering of transcriptional networks to external insult through complex overlapping patterns of expression, as suggested in Chapman et al., 2006, and further reinforced by the fact that transcriptional regulators are disproportionately retained after WGD events (Maere et al., 2005; Blomme et al., 2006). It is important to note that a range of other factors, particularly increased network-interaction complexity, may have contributed to the diversification in the Vertebrata, and this issue will be an interesting topic for future study in cases of invertebrate WGD.

The relative morphological stasis of horseshoe crabs implies that novelty at a genomic level has not been mirrored phenotypically in this clade. Does this suggest a general difference between invertebrates and vertebrates in the evolutionary significance of WGD? Before drawing such a conclusion, it is important to note that the extent to which vertebrate WGD affected the later evolution of this clade has been a subject of debate (Furlong and Holland, 2002; Dehal and Boore, 2005; Donoghue and Purnell, 2005). Additional data will be of much aid in discerning what influence WGD has on subsequent genomic and phenotypic evolution. Further examination of patterns of gene gain/loss, expression and functional divergence of paralogues, and regulatory network changes (such as post-transcriptional gene regulation by microRNAs) following WGD in the Xiphosura and in vertebrates are warranted to disentangle this complex issue.

This work provides clear evidence of an ancient WGD event in the horseshoe crab lineage through the genomic sequencing of three of the four extant species. This is the first WGD event to be described in the ecdysozoans, and is one of very few such events to be known from the animal kingdom. The data presented here urge reexamination of long-standing hypothesis regarding the evolutionary outcomes of WGDs in animals. The three xiphosuran cases identified here will provide evolutionary comparison points for the inference of the effect of polyploidy on animals.

Data archiving

All data related to this work are available for download from the NCBI short read archive and our own website (tinyurl.com/HSCgenomes) and the Dryad repository linked to this article (url: http://dx.doi.org/10.5061/dryad.81fv1).