Data Descriptor: Draft genomic and transcriptome resources for marine chelicerate Tachypleus tridentatus

,


Background & Summary
Horseshoe crabs in the world are represented by two extant populations following a particular geographical distribution 1 . Tachypleus tridentatus (2n = 26), Tachypleus gigas (2n = 28) and Carcinoscorpius rotundicauda (2n = 32) inhabit the Asian coastline of Southeast Asia; Limulus polyphemus (2n = 52) is distributed along the Atlantic coastline of North America 2 . Horseshoe crabs have the rare feature of providing blue blood, which has extensive infection-fighting properties that can be used to quantify gram-negative bacterial endotoxins or fungal contamination of medical products 3 . The horseshoe crabs show some features of crustaceans (crab shell and claws), but they are not crabs at all and are more closely related to spiders, scorpions and trilobites 4,5 . They belong to their own class called Merostomata, which shows three divisions of the body: the prosoma, opisthosoma and telson.
T. tridentatus, the tri-spine horseshoe crab, the largest of the living horseshoe crab species, was once widespread along the coast of Fujian, Hainan and Beibu Gulf Bay 2 , but now, the population has drastically decreased for various reasons, such as environmental pollution, decreasing coastline and excessive exploitation for blood extracts (amoebocyte lysate). T. tridentatus is now a 'Grade II Protected Animal of China' and is 'protected aquatic wildlife' in the Fujian and Guangxi Zhuang Autonomous Region. Although many horseshoe crab nature reserves have been established since 2001 in Guangdong Province around the northern South China Sea, more marine reserve networks for horseshoe crabs in the South and East China Seaboards are needed to establish with hope for protection for the habitat and recovery of this population of the T. tridentatus 6 .
The external morphology of horseshoe crabs has remained nearly unchanged since the Ordovician approximately 445 million years ago 7 . The morphological resemblance of a modern horseshoe crab to its ancient fossil suggests the question of whether the genome of the horseshoe crab has been as slow to evolve as the horseshoe crab's outward appearance implies. Although several genome or transcriptome projects have been implemented on the horseshoe crabs with genotyping-by-sequencing or low-depth sequencing strategies, such as provided in the draft genome of the Atlantic horseshoe crab genome (L. polyphemus) 8-10 , the partial genome of three horseshoe crabs (C. rotundicauda, L. polyphemus and T. tridentatus) 11 , and the de novo transcriptome of T. tridentatus during embryonic development 12 , the lack of fully sequenced and finely annotated horseshoe crab genomic information has hindered the emergence of arthropod models for slow evolution, which has been widely discussed in the brachiopod Lingula anatina 13 and the Sarcopterygii Latimeria chalumnae 14 .
To understand the genomic feature of the living fossil and protect these endangered species, we have generated a draft genome and transcriptome of T. tridentatus. We extracted genomic DNA from leg muscle tissues, constructed 4 Paired-End (PE) and 5 Mate-Pair (MP) libraries, which were subsequently sequenced using the Illumina HiSeq 2500 or Hiseq 4000 platforms. Three stages (pre-trilobite, trilobite, and post-trilobite) of larva, containing two biological replicates for each stage, were chosen for further transcriptome sequencing. After filtering out the low quality, adaptor-contaminated or PCR duplication reads of genome sequencing, we applied the KMERFREQ_AR v2.0.4 15 and JELLYFISH v1.1 16 to evaluate the genome size, heterozygosis and repetition by calculating the frequency of 17-mers and then used the Platanus v1.2.4 17 to assemble the genome using all clean reads from PE and MP libraries. Another round of gap-filling steps was performed based on the assembly results, utilizing GapCloser v1.12-r6 15 . The final assembled genome size was 1.94 Gb, containing 736,826 contigs (N50 = 52.2 kb) and 671,877 scaffolds (N50 = 2.76 Mb), which represented approximately 90.16% of the genome estimated via k-mer analysis. The BUSCO v2.1 18 and de novo assembled transcript evaluation demonstrated the genome was of considerable completeness and high gene region coverage. The transposable elements accounted for 39.96% of the T. tridentatus genome, whereas the DNA transposons accounted for the largest portion, 23.06%. The structural annotation of the genome yielded 29,134 genes, and approximately 83.37% of these genes were functionally annotated with at least one of four sources (InterPro, KEGG, SwissProt and TrEMBL). The synonymous substitution (Ks) distribution of the T. tridentatus paralogues indicated the Chinese horseshoe crabs had undergone two rounds of WGDs, as has its Atlantic counterpart. The release of the T. tridentatus genome and transcriptome in this study provides a crucial resource for future efforts to adopt better strategies to conserve the endangered Asian horseshoe crab and to take it as the model species to foster study on marine chelicerates.

Tissue sampling
For genomic sequencing, an adult T. tridentatus ( Fig. 1) was acquired from the Guangxi Key Laboratory of Beibu Gulf Marine Biodiversity Conservation. Leg skeletal muscle tissue was collected and then stored in liquid nitrogen for immediate DNA extraction. For transcriptome sequencing, six fertilized eggs were collected by laboratory spawning and incubated with standard procedures under appropriate temperature and salinity 19 . Three larval stages were collected according to Sekiguchi's definition 20 : 'pre-trilobite', 'trilobite' and 'post-trilobite'. Two biological replicates for each larval stage were collected and stored in liquid nitrogen for immediate RNA extraction.

Library construction and sequence quality control
Genomic DNA was extracted from frozen muscle tissues of T. tridentatus using a genomic DNA isolation kit (Qiagen, Hilden, Germany) and the manufacturer's protocol and then stored at −80°C until library preparation. Subsequently, 4 pair-end libraries with insert sizes of 270 bp, 300 bp, 500 bp and 800 bp, and 5 mate-pair libraries with insert sizes of 2 kb, 5 kb, 10 kb, 20 kb and 40 kb were generated (Data Citation 1), sequenced by BGI (Shen Zhen) on Illumina Hiseq 2500 or Hiseq 4000, following the manufacturer's instruction. Finally, we generated a total of 554.2 Gb raw bases from 19 lanes ( Table 1). The raw reads of the PE and MP libraries were then filtered by SOAPnuke v1.5 21 according to the following criteria: 1) reads with adapter contamination or polymerase chain reaction duplicates were discarded, 2) reads with more than 30% low-quality bases (Q value ≤ 15) for the pair-end reads and more than 20% low-quality bases (Q value ≤ 10) for the mate-pair reads were removed, 3) reads with more than 1% N bases were removed, and 4) matching lengths of read 1 and read 2 were removed when 10 bp had at least a 10% mismatch. After sequences had been pre-processed with SOAPnuke v1.5 software, a total of 202.67 Gb clean pair-end and 71.54 Gb mate-pair reads were obtained ( Table 1).
The total RNA of each larval development was extracted separately according to the TRIzol protocol (Invitrogen). Samples were sequenced by BGI (Shen Zhen) on an Illumina HiSeq 2500 platform (Data Citation 1). Paired-end reads were generated with a read length of 100 bp. A total of 436.63 Mb RNA-Seq reads were produced ( Table 2). The raw reads of the transcriptome libraries were filtered by SOAPnuke v1.5 according to the following criteria: 1) reads with more than 1% N bases were removed; 2) reads with more than 20% low-quality bases (Q value ≤ 10) were removed; and 3) reads with adapter contamination were discarded. After pre-processing with SOAPnuke software, a total of 393.57 Mb RNA-seq reads were generated with an average Q20 ≥ 96% ( Table 2).

Characteristics of the T. tridentatus genome
A total of 91.34 Gb clean reads from libraries for insert sizes of 300 bp and 500 bp were chosen for use in k-mer analysis using KMERFREQ_AR v2.0.4 15 and JELLYFISH v1.1 16 with a k length of 17. The genome size (G) of T. tridentatus was estimated by the following formula: G = k-mer number/k-mer depth, where the k-mer number is the total numbers of k-mers, and k-mer depth refers to the depths of the main peak in the k-mer frequency distribution. The k-mer frequency distribution indicated that T. tridentatus was a diploid species with low heterozygosity and repetition (Fig. 2). The frequency of 17-mers using KMERFREQ_AR v2.0.4 15 estimated a genome size of 2.15 Gb, whereas the JELLYFISH v1.1 16 approach yielded an estimated genome size of 2.14 Gb (Supplementary Table 1). The result of JELLYFISH was subsequently delivered to GenomeScope 22 , indicating the heterozygosity of the T. tridentatus genome was approximately 0.65%. The estimated genome size of T. tridentatus is much smaller than L. polyphemus, which is estimated to be 2.74 Gb (2.8 pg) based on the biochemical method 23 .   Table 2. RNA-Seq data yields of three larval stages. Note: The three developmental stages of larva were collected according to Sekiguchi' s definition, "pre-trilobite", "trilobite" and "post-trilobite". Two biological replicates for each stage. The raw reads were filtered using SOAPnuke.

Genome assembly
For genome assembly, the contig construction, scaffold connection and gap-closer of the T. tridentatus genome was performed based on multiple k-mer values automatically optimized by the Platanus v1.2.4 17 , with clean reads from pair-end and mate-pair libraries. All applications were used with the default parameters except for the initial k-mer 37. Furthermore, one more round of gap-closer was performed using GapCloser v1.12-r6 15 to fill the remaining gaps in the scaffolds. The final assembly contained 671,877 scaffolds and included a total length of 1.94 Gb, which represented approximate 90.16% of the genome estimated by the k-mer analysis. The sizes of the longest scaffold and contig were 18,230,544 and 1,165,240 bp, respectively, and 87.92% of the assembled sequences with lengths longer than 10 kb were among the 2,573 scaffolds. The contig and scaffold N50s were 52,179 and 2,761,313 bp, respectively, whereas the percentage of the gap in the genome was 1.55% (Table 3).

Transcriptome assembly
The de novo transcriptome assembly of each larval stage was performed using Trinity v2.4.0 24 with default parameters. The Trinity assemble resulted in 669,788 transcripts with an average N50~1 Kb (Supplementary Table 2). The transcriptome assembly of the same stage was then clustered to remove redundancies and form the Unigenes using the TGI Clustering Tool (TGICL) v2.  27,28 . The genome index was built using HISAT2build, and then, the clean transcriptome reads were mapped to the genome using HiSat2, and the alignment result of each larval stage was merged to form one single BAM file using SAMtools v1.3 29 . Finally, the genome-guided transcriptome assembly was performed using StringTie with the single BAM file.

Repeat annotation
Prior to gene prediction, the transposable elements were identified in the genome of T. tridentatus. The transposable elements were identified by a combination of homology and ab initio-based methods. For the homology method, RepeatMaskerv4.0.5 30

Gene prediction
The MAKER 34 application was used to predict the gene model by integrating the homology, transcriptome and ab initio gene predictions. MAKER was initially run in the est2genome = 1 and protein2genome = 1 model, which created a gene model directly from the transcript and protein evidence. The transcript evidence was based on de novo transcriptome assembly of the three larval stages, which included 273,085 transcripts (Supplementary Table 2), whereas the 96,006 protein sequences were collected from the genome of four arthropods deposited at NCBI: Limulus polyphemus 10,35 , Mesobuthus martensii 36 , Stegodyphus mimosarum 37 and Ixodes scapularis 38 . The option 'rm_gff' was filled out with the transposable element file acquired from the repeat annotation step, and the expected max intron size for evidence alignments was set to 30,000 according to the max intron size of L. polyphemus. The initial run produced 25,252 gene models, and 2,000 high-confidence gene models were randomly selected according to the criterion determined by the maker2zff (an application in MAKER pipeline) default parameters, except for the maximum annotation edit distance (AED) of 0.1. These high confidence gene models were then used to train the parameters of SNAP 39 (release-2013-11-29) and AUGUSTUS v3.3.1 40 software with each pipeline. A second round of MAKER was run with the training parameters, est2genome = 0 and protein2genome = 0 mode. All other parameters were the same as the first round except that we used 'est_gff' with the genome guiding transcriptome assembly to replace the 'est' option.

Gene function annotation
Protein sequences from the predicted gene models were searched against the KEGG 41 Table 4).

Whole-genome duplications in L. polyphemus and T. tridentatus
The synonymous substitutions (Ks) distribution had been used to infer WGD in plant and vertebrates 55,56 . To inspect the evidence for the whole-gene duplication of the Chinese and Atlantic horseshoe crabs, we identified the paralogues of each species and calculated the substitutions per synonymous site (Ks) distribution of paralogous pairs with a Python script available online: https:// github.com/EndymionCooper/KSPlotting. The main steps were listed as follows. 1) Sequence similarity was determined using an all-vs-all comparison of protein sequences, performed using BLASTP + v2.50 57 with an E-value threshold of 1e-5. The paired genes were retained under the criterion that the shorter sequence was at least 50% of the longer sequence and that the alignment length was at least 50% of the shorter sequence. 2) Paralogue gene family construction was determined by the paralogue gene families that were built through single-linkage clustering. In brief, the genes with multiple alignment and associated matches were grouped into the same paralogue categories. After the paralogues had been identified, all possible pairs of protein sequences in each paralogue family were aligned using MUSCLE 58 v3.8.31 with default parameters, and then the multiple alignments of the amino acid sequences were converted to the corresponding coding sequences (CDS). 3) Synonymous substitutions (Ks) values of each paired paralogue gene were calculated using CODEML 59 . Only gene pairs with a Ks estimate of o3 were considered for further analysis. A paralogous gene family of n members was derived from n-1 possible duplication events, but the number of probable pairwise Ks comparisons within a family was n × (n-1)/2, which could result in misconception of the ages of duplication events. The Ks values of each paralogous group were corrected to remove redundancy using a hierarchical clustering approach 60,61 , leaving the paralogous groups that contained the representative duplication events. We also retrieved allvs-all alignment using the reciprocal best blast hit (RBH) criterion to identify the orthologous genes, and then, we applied KaKs Calculator 62 with the method of Yang and Nielsen 63 to estimate synonymous substitution rates (Ks) of the orthologs.

Code availability
The software versions, settings and parameters are described below.

Data Records
The raw data of the whole genome and the RNA-seq sequencing was submitted to the National Center for Biotechnology Information (NCBI) (Data Citations 1), and more detailed information about the reads is shown in Data_Descriptor_Worksheets.xlsx. The final assembly was deposited at NCBI GenBank (Data   Citations 2). The other files, such as the assembled contigs, scaffold, de novo RNA-seq assembly, repeat annotation, gene prediction and gene function annotation were uploaded to Figshare (Data Citations 3), and the file with descriptions is presented in Supplementary Table 5.

Technical Validation
Genome assembly and gene prediction quality assessment The completeness of the genome assembly and gene prediction was assessed using BUSCO 18 based on evolutionarily informed expectations of gene content from near-universal single-copy orthologues selected from OrthoDBv9 64 . The completeness of the resulting assembly was comparable to the other arthropods; BUSCO analysis showed that 96.2 and 0.8% of the 1066 arthropod datasets (arthropoda_odb9, http://busco.ezlab.org/) were identified as complete and fragmented separately, whereas 3% of the BUSCO genes were missing from the assembled genome sequence (Supplementary Table 6). The completeness of the current gene prediction is also comparable to the other arthropod, 95.4% and 2.3% of the 1066 arthropod datasets (arthropoda_odb9, http://busco.ezlab.org/) were identified as complete and fragmented separately, whereas 2.3% of the BUSCO genes were missing from the current annotation (Supplementary Table 7). The de novo transcriptome assembly of three larval stages was aligned to the genome using BLAT 65 with default parameters to evaluate the coverage of the gene region. The alignment results of the Unigenes indicated that the assembled genome of T. tridentatus covered 95.59~98.01% of the Unigenes, 89.02~93.55% of the Unigenes with at least 90% coverage in one scaffold, and 93.62~98.81% of the Unigenes with at least 50% coverage in one scaffold (Supplementary Table 8), suggesting that the gene regions were mostly included in the current assembly.

Comparison with other horseshoe crab genomes
The current assembly of the T. tridentatus genome may be the most complete version of horseshoe crab. The contig and scaffold N50 of the current assembly were much longer than any other published horseshoe crab databases (Supplementary Table 9), the sequence depths and assembled sizes were also much larger than other sources. The longest scaffold of T. tridentatus measures 18,230,544 bp, which is three times more than L. polyphemus (PRJNA20489). The numbers of ≥1 Mb scaffold in T. tridentatus are 446, while the numbers of ≥ 1 Mb scaffold in L. polyphemus only are 147 ( Table 3). The Ks distribution of L. polyphemus exhibited peak centred at 0.14 and 0.84, whereas the Ks distribution of T. tridentatus showed peaks centred at 0.16 and 1.16 (Fig. 3). The evidence showed the lineage of the horseshoe crabs had undergone two rounds of WGDs, which was consistent with the previous result 9 but with a variant peak. Meanwhile, the Ks distribution of the orthologous genes between T. tridentatus and L. polyphemus showed a peak centred at 0.20 (Fig. 3), which suggests the last common ancestor of the extant horseshoe carb was posterior to the ancient duplication but predated the recent duplication.

Usage notes
All analyses were run on Linux systems, and the optimal parameters are given in the Code availability section.