Horseshoe crab genomes reveal the evolutionary fates of genes and microRNAs after three rounds (3R) of whole genome duplication

Whole genome duplication (WGD) has occurred in relatively few sexually reproducing invertebrates. Consequently, the WGD that occurred in the common ancestor of horseshoe crabs ~135 million years ago provides a rare opportunity to decipher the evolutionary consequences of a duplicated invertebrate genome. Here, we present a high-quality genome assembly for the mangrove horseshoe crab Carcinoscorpius rotundicauda (1.7Gb, N50 = 90.2Mb, with 89.8% sequences anchored to 16 pseudomolecules, 2n = 32), and a resequenced genome of the tri-spine horseshoe crab Tachypleus tridentatus (1.7Gb, N50 = 109.7Mb). Analyses of gene families, microRNAs, and synteny show that horseshoe crabs have undergone three rounds (3R) of WGD, and that these WGD events are shared with spiders. Comparison of the genomes of C. rotundicauda and T. tridentatus populations from several geographic locations further elucidates the diverse fates of both coding and noncoding genes. Together, the present study represents a cornerstone for a better understanding of the consequences of invertebrate WGD events on evolutionary fates of genes and microRNAs at individual and population levels, and highlights the genetic diversity with practical values for breeding programs and conservation of horseshoe crabs.

). 14 Horseshoe crabs are considered to be 'living fossils', with the oldest fossils dated 15 from the Ordovician period (~450 million years ago (Mya), Rudkin and Young 2009). 16 However, despite this long history, there are only four extant species of horseshoe crabs 17 worldwide: the Atlantic horseshoe crab (Limulus polyphemus) from the Atlantic East Coast of 18 North America, and the mangrove horseshoe crab (Carcinoscorpius rotundicauda), the Indo- 19 Pacific horseshoe crab (Tachypleus gigas), and the tri-spine horseshoe crab (Tachypleus 20 tridentatus), from South and East Asia (John et al 2018). All extant horseshoe crabs are 21 estimated to have diverged from a common ancestor that existed ~135 Mya (Obst et al 2012), 22 and they share an ancestral WGD (Kenny et al 2016). A high-quality genome assembly was 23 recently announced as a genomic resource for T. tridentatus (Gong et 27 In the present study, we provide the first high quality genome of the mangrove 28 horseshoe crab (C. rotundicauda), and a resequenced genome of tri-spine horseshoe crab (T. 29 tridentatus). Importantly, we present evidence for the number of rounds of WGD that have 30 occurred in these genomes, and investigate if these represent a shared event with spiders. We 31 also examine the evolutionary fate of genes and microRNAs at both the individual and 32 4 population level in these genomes. Collectively, this study highlights the evolutionary 1 consequences of a unique invertebrate WGD, while also providing detailed genetic insights 2 which will also be useful for various genomic, biomedical, and conservation measures. 3 4 Results and Discussion 5 High-quality genomes of two horseshoe crabs 6 Genomic DNA was extracted from single individuals of two species of horseshoe crab, 7 C. rotundicauda and T. tridentatus ( Figure 1B), and sequenced using Illumina short-read, 8 10X Genomics linked-read, and PacBio long-read sequencing platforms (Supplementary 9 information S1, Table 1.1.1-1.1.2). Hi-C libraries were also constructed for both species 10 sequenced using the Illumina platform (Supplementary information S1, Figure S1.1.1-1.1.2). 11 For the final genome assemblies, both genomes were first assembled using short-reads, 12 followed by scaffolding with Hi-C data. The C. rotundicauda genome assembly is 1.72 Gb 13 with a scaffold N50 of 90.2 Mb ( Figure 1C). The high physical contiguity of the genome is 14 matched by high completeness, with 93.8% complete BUSCO core eukaryotic genes ( Figure   15 1C). The T. tridentatus genome is 1.72 Gb with a scaffold N50 of 109.7 Mb and 93.7 % 16 BUSCO completeness ( Figure 1C). In total, the C. rotundicauda and T. tridentatus genome 17 assemblies include 34,354 and 42,906 gene models, respectively. Furthermore, 89.8% of the 18 sequences assembled for C. rotundicauda genome are contained on just 16 pseudomolecules, 19 consistent with a near chromosome-level assembly (chromosome 2n=32, Iswasaki et al 1988, 20 Supplementary information S1, Table 1.1.3). 21 To date, the only repeat data available for horseshoe crabs are two independent 22 analyses of the tri-spine horseshoe crab T. tridentatus, which identified a repeat content of In the present study, we provide the 24 first analysis of repeat content in the genomes of different horseshoe crab species, by 25 analysing repeats in our genome assembly for T. tridentatus, as well as our assembly for the 26 mangrove horseshoe crab, C. rotunicauda. We find that repeat content is similar in both 27 genomes, occupying approximately one third of total genomic content. Specifically, we 28 identify a total repeat content of 32.99% for T. tridentatus and 35.01% for C. rotunicauda, of 29 which the dominant repeats are DNA elements, followed by LINEs, with SINEs and LTR 30 5 elements contributing just a small proportion of total repeat content ( Figure 1D, 1 Supplementary information S1, Table 1.2.1). 2 A large proportion of eukaryotic genomes is typically composed of repetitive DNA, 3 and repeats are widely cited as being one of the key determinants of genome size (Chénais et 4 al 2012). However, while the genome size for both species of horseshoe crab sequenced here 5 is comparatively large for invertebrates, their repeat content is not unusually high (C. 6 rotundicauda: 35.02%, T. tridentatus: 32.98%, Figure 1D, Supplementary information S1, 7 Table 1.2.1). Instead, the comparatively large size of horseshoe crab genomes appears to be a 8 consequence of multiple rounds of WGD, as discussed in greater detail below. 9 In the C. rotundicauda genome, repeats are evenly distributed across genic and 10 intergenic regions ( Figure 1D). However, in the T. tridentatus genome, a greater proportion 11 of repeats are found in genic regions, due primarily to a higher density of DNA elements and 12 LINEs, as well as unclassified elements ( Figure 1D). Repeat landscape plots ( Figure 1D) 13 suggest a relatively similar pattern of historical transposable element activity for both 14 horseshoe crab species. Recent activity appears to have tapered off more quickly in the T. 15 tridentatus genome, particularly with respect to LTR elements and certain DNA elements 16 ( Figure 1D). 17 Three rounds (3R) of whole genome duplications in horseshoe crabs 18 Initial efforts to analyse WGD in extant horseshoe crabs were from low-depth and 19 genotyping-by-sequencing which hindered the understanding of WGD in these taxa (Nossa et 22 genome assembly has the largest contig N50 ( Figure 1C). Furthermore, our assembly for C. 23 rotundicauda represents the first close to chromosomal-level genome assembly for this 24 species. Consequently, the two high-quality horseshoe crab genomes presented in this study 25 provide us with an unprecedented opportunity to address the issue of invertebrate WGD and 26 its evolutionary consequences. 27 An important outstanding question is how many rounds of WGD occurred in the last  question, we first investigated the number and genomic location of Hox cluster genes, which 1 have played the role of a "Rosetta stone" for understanding animal evolution (Holland 2017 Hox genes, providing evidence that two rounds of WGD occurred between the most recent   contrary, our results suggested that there are three Hox clusters (including Ftz), and thus more 16 than one round of WGD occurred in the lineage leading to extant horseshoe crabs. 17 We then investigated the sister cluster of the Hox genes -the ParaHox cluster genes, 18 which are also highly clustered in bilaterians (Brooke et al 1998;Hui et al 2009;. 19 Similar to the Hox cluster genes, the invertebrate amphioxus contains only a single ParaHox 20 gene cluster in its genome, while the ParaHox cluster genes are located on four chromosomes 21 in human (Putnam et al 2008). In comparison, both the horseshoe crab genomes for C. 22 rotundicauda and T. tridentatus contain two ParaHox clusters, composed of Gsx and Cdx, 23 with other ParaHox genes located on three scaffolds. Meanwhile, in the genome assembly of  Using genome-wide analyses of homeobox gene content in three horseshoe crab 1 genomes, we find that many homeobox genes are present in more than 4 copies ( Figure 2D, 2 details are shown in Supplementary information S1, Table 1 this question, we further carried out genome-wide synteny analyses to shed further light on 6 the situation. As shown in Figure 3A, using a default of a minimum of 7 genes to define a 7 syntenic block, most of the chromosomes of C. rotundicauda exhibit synteny with on other 8 chromosomes, with most of them have a number between 4-8 (including its own copy). Thus, 9 we propose that a 3R WGD occurred in the horseshoe crab. 10 11 Shared or independent duplications with spider? 12 Another major unresolved question relating to horseshoe crab genomes is whether the 13 reported cases of WGD in chelicerates constitute shared or independent events. Gene family 14 analyses of spider and scorpion genomes have suggested that an ancient WGD is shared 15 between them, independent of the WGDs that occurred in horseshoe crabs (Schwager et al 16 2017). Using the two horseshoe crab genome assemblies generated here, we tackled this 17 important question from two different perspectives: (i) we performed analyses of synteny as 18 a more rigorous examination of the question, and, (ii) we reconsidered recent evidence on 19 phylogenetic relationships within the Chelicerata. 20 We first carried out the syntenic analyses between the Hox scaffolds of C. 3B). Despite no clear shared duplication event between C. rotundicauda and spider Hox, 23 surprisingly, we observed syntenic relationships between two Hox scaffolds when using a 24 minimum of 5 genes to define a syntenic block ( Figure 3B). Similarly, in the syntenic 25 comparison of Hox scaffolds of T. tridentatus and the published spider and scorpion genomes, 26 we could observe syntenic relationships between two different Hox scaffolds when using a 27 minimum of 5 genes to define a syntenic block ( Figure 3B). In a less stringent condition of 28 using a minimum of 2 genes to define a syntenic block, we additionally observed syntenic 29 relationships between two other Hox scaffolds between T. tridentatus and spider ( Figure 3B). Our data, suggested for the first time, that the WGD in horseshoe crab is a shared event with 1 the WGD in spider and scorpion. 2 An important consideration necessary to fully understand WGD events identified 3 from horseshoe crab genomes are the phylogenetic relationships between these animals. 4 Horseshoe crabs have long been regarded as a monophyletic group (Xiphosura) and the sister 5 group to the terrestrial chelicerate clade that includes spiders and scorpions (Arachinida). 6 However, in a recent phylogenetic analysis using publicly available data, including three 7 xiphosurans, two pycnogonids, and 34 arachnids, it has been suggested that the horseshoe   Table   22 1.1.5-1.1.6), we analysed the evolutionary consequences of small noncoding RNAs after the 23 WGD events in both C. rotundicauda and T. tridentatus. To reveal if duplicated microRNAs 24 can also provide insights into the number of rounds of WGD, we first examined the number 25 of paralogues for the bilaterian conserved set of 57 microRNAs, across three horseshoe crab 26 genomes ( Figure 4A). Of these microRNAs, 27 and 33 have more than 4 copies in T. 27 tridentatus and C. rotundicauda respectively ( Figure 4A). These data further support the 28 hypothesis that 3R WGD occurred in the horseshoe crabs. To understand the fates of microRNA paralogues, we first analysed the sequence 1 conservation/divergence of 41 conserved microRNA families and 4 chelicerate-specific 2 microRNAs by aligning their sequences (Supplementary information S1, Figure S1.2.10). We 3 found that the paralogues always have more sequence conservation in one arm (rather than 4 showing similar conservation for both arms across paralogues) after WGD (Supplementary 5 information S1, Figure S1.2.10). An example is illustrated for the microRNA bantam, where 6 the sequence of the 5p arm is less conserved than the 3p arm between paralogues ( Figure   7 4Ba). 8 To explore whether the more conserved microRNA arm correlates with expression 9 level, we mapped small RNA reads to different paralogues. By eliminating microRNA  For example, the 3p arm shows more sequence conservation between the bantam paralogues 14 in horseshoe crabs, and their 3p arms also show higher expression levels than their 5p arms 15 ( Figure 4Bb, Supplementary information S3). The 26 conserved microRNAs identified as 16 showing higher expression levels for the conserved arm serve as the first example correlating 17 expression level and conservation of mature microRNA sequences in paralogues following 18 WGD. 19 In addition to relatively old conserved microRNAs, we also investigated new/novel 20 microRNAs which are specific to a certain horseshoe crab species, to understand the impact 21 of WGD on these. A total of 12 novel microRNAs were identified and conserved between C. 22 rotundicauda and T. tridentatus (Supplementary information S1, Figure S1.2.10). The 23 identified novel microRNAs are highly conserved in sequences between orthologues than 24 paralogues, an example is shown in Figure 4Bc, suggesting these horseshoe crab-specific 25 novel microRNAs are born at the horseshoe crab ancestor after WGD.

26
In the common house spider Parasteatoda tepidariorum which is believed to have  In summary, the first investigation of microRNAs in horseshoe crabs provide another 10 dimension for understanding the fates of duplicated noncoding microRNAs in invertebrates. 11 12 WGD at population level 13 Another question that remains poorly explored is the evolutionary consequences of 14 WGD on gene duplicates at the population level. Individuals of both C. rotundicauda and T. 15 tridentatus were collected from different locations across Asia and subjected to genome 16 sequencing ( Figure 5A, Supplementary information S1, Table 1 Asia, which may be due to the strong ocean currents that had prevented the gene flows 22 between these locations ( Figure 5B; Supplementary information S1, Figure S1.2.6-1.2.7). 23 Taking advantage of these population genomic data, we further asked the question of  In T. tridentatus, non-synonymous substitutions at the homeodomain of Six3/6-like 1 and Onecut-E genes were revealed in certain individuals from Malaysia populations ( Figure   2 5Ca-b). Similarly for C. roundiculata, non-synonymous substitutions at the homeodomain of 3 the En-D gene were also revealed in some individuals from populations in Thailand ( Figure   4 5Cc). This is the first evidence showing that different gene duplicates after WGD in 5 invertebrates are under different rates of mutation and selection at the individual level. 6 Importantly, unique pseudogenisation was discovered in the paralogue of Unpg in 7 many individuals in the C. rotundicauda population located in Hong Kong ( Figure 5D). In 9 8 out of the 10 individuals captured in Hong Kong for sequencing, we found that there is an 9 alternative form (ALT), with a deletion in Unpg-A1 ( Figure 5D). Given that homeodomains 10 are standardised as transcription factors with a sequence length of ~60-63 amino acids 11 (Holland 2013), the deletion suggests that in these individuals these genes are in the process 12 of becoming pseudogenes. This is the first evidence demonstrating the ongoing and dynamic 13 mutation rate of paralogues at population level after WGD in invertebrates. 14

16
WGD remains an understudied area, particularly in invertebrates such as the horseshoe crabs, 17 despite its considerable importance in animal evolution. This study provides evidence of the 18 3R WGD events in horseshoe crabs, and sheds light on the evolutionary fates of genes and 19 microRNAs at both the individual and population levels, as well as highlighting the genetic 20 diversity of these amazing animals, with importance for understanding their evolution, 21 genomics, and practical value for breeding programs and conservation. 22 23 Materials and methods 24 DNA, mRNA and sRNA extraction and sequencing 25 Genomic DNA of the horseshoe crabs C. rotundicauda and T. tridentatus was isolated 26 from the leg muscle of a single individual in each case, using the PureLink Genomic DNA 27 Kit (Invitrogen). In addition, different tissues were dissected and homogenized in Trizol 28 reagent (Invitrogen), and total RNA was isolated following the manufacturers' instructions.

29
Blood samples of both species of horseshoe crab were drawn by syringe and directly 1 transferred into Trizol reagent for RNA extraction. For egg, 1 st , 2 nd and 3 rd instars of T. 2 tridentatus, whole individuals were used for RNA extraction. Extracted gDNA was subject to 3 quality control using gel electrophoresis. Qualified samples were sent to Novogene and 4 Dovetail Genomics for library preparation and sequencing. In addition, a Chicago library was 5 prepared by Dovetail Genomics using the method described by Putnam et al (2016). Briefly, 6 ~500ng of high molecular weight gDNA (mean fragment length = 55 kb) was reconstituted 7 into artifical chromatin in vitro and fixed with formaldehyde. Fixed chromatin was digested 8 with DpnII, the 5' overhangs filled in with biotinylated nucleotides, and free blunt ends were 9 ligated. After ligation, crosslinks were reversed, and the DNA purified. Purified DNA was 10 treated to remove biotin that was not internal to ligated fragments. The DNA was then 11 sheared to ~350 bp mean fragment size and sequencing libraries were generated using 12 NEBNext Ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments 13 were isolated using streptavidin beads before PCR enrichment of each library. The libraries 14 were sequenced on the Illumina HiSeq X platform. Dovetail HiC libraries were prepared as DpnII, the 5' overhangs filled in with biotinylated nucleotides, and free blunt ends were 18 ligated. After ligation, crosslinks were reversed and the DNA purified. Purified DNA was 19 treated to remove biotin that was not internal to ligated fragments. The DNA was then 20 sheared to ~350 bp mean fragment size and sequencing libraries were generated using 21 NEBNext Ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments 22 were isolated using streptavidin beads before PCR enrichment of each library. Details of the 23 sequencing data can be found in Supplementary information S1,    Genome, mRNA transcriptome, and sRNA assembly and annotation 5 To process the Illumina sequencing data, adapters were trimmed and reads were 6 filtered using the following parameters "-n 0.1 (i.e. removal if N accounted for 10% or more  16 Chromium WGS reads were separately used to make a de novo assembly using Supernova (v 17 2.1.1), with the parameter "--maxreads=231545066" for C. rotundicauda, and "--18 maxreads=100000000" for T. tridentatus, respectively. The de novo assembly, shotgun reads, 19 Chicago library reads, and Dovetail HiC library reads were used as input data for HiRise, a 20 software pipeline designed for using proximity ligation data to scaffold genome assemblies 21 (Putnam et al, 2016). An iterative analysis was conducted. First, Shotgun and Chicago library 22 sequences were aligned to the draft input assembly using a modified SNAP read mapper 23 (http://snap.cs.berkeley.edu). The separation of Chicago read pairs mapped within draft 24 scaffolds was analysed by HiRise to produce a likelihood model for genomic distance 25 between read pairs, and the model was used to identify and break putative misjoins, to score 26 prospective joins, and to make joins above a threshold. After aligning and scaffolding 27 Chicago data, Dovetail HiC library sequences were aligned and scaffolded following the 28 same method. After scaffolding, shotgun sequences were used to close gaps between contigs. To process small RNA data, we removed small RNA sequencing raw reads with 21 Phred quality score less than 20, and adaptor sequences were trimmed. Processed reads of

13
"Genic" repetitive elements were defined as those overlapping loci annotated as genes ± 2kb 14 and identified using the BedTools window function. All plots were generated using Rstudio    MicroRNA arm switching detection 15 The expression levels of 5p and 3p arms of microRNAs in the horseshoe crabs were 16 calculated based on the number of sequencing reads mapped to the respective arm region in 17 the predicted microRNA hairpin using bowtie/mirDeep2. Sciences of The Chinese University of Hong Kong (JHLH).    dominance, red-5p dominance, yellow-no preference, white-no expression. population. 15 16 Supplementary information S1. Supplementary data. 17 18 Supplementary information S2. Information of homeobox gene sequence and genomic 19 locations.

21
Supplementary information S3. MicroRNA contents and arm usage of the two horseshoe 22 crabs. 23 24 Supplementary information S4. SNPs at the homeodomains of the two horseshoe crabs.