First draft genome for the sand-hopper Trinorchestia longiramus

Crustacean amphipods are important trophic links between primary producers and higher consumers. Although most amphipods occur in or around aquatic environments, the family Talitridae is the only family found in terrestrial and semi-terrestrial habitats. The sand-hopper Trinorchestia longiramus is a talitrid species often found in the sandy beaches of South Korea. In this study, we present the first draft genome assembly and annotation of this species. We generated ~380.3 Gb of sequencing data assembled in a 0.89 Gb draft genome. Annotation analysis estimated 26,080 protein-coding genes, with 89.9% genome completeness. Comparison with other amphipods showed that T. longiramus has 327 unique orthologous gene clusters, many of which are expanded gene families responsible for cellular transport of toxic substances, homeostatic processes, and ionic and osmotic stress tolerance. This first talitrid genome will be useful for further understanding the mechanisms of adaptation in terrestrial environments, the effects of heavy metal toxicity, as well as for studies of comparative genomic variation across amphipods.

www.nature.com/scientificdata www.nature.com/scientificdata/ expansion of particular gene families, including those related to response to stress, homeostatic process, transmembrane transport, and signal transduction. A phylogenetic analysis with related amphipod and arthropod species suggests that T. longiramus diverged from the H. azteca during the Late Cenozoic era. This first talitrid genome will be useful for further understanding the mechanisms of adaptation in terrestrial environments, the effects of heavy metal toxicity, as well as for studies of comparative genomic variation across amphipods.

Methods
Sample collection and extraction of DNa and rNa. T. longiramus samples were collected from the coast (37°41′29″N, 129°2′2.7″E) of South Korea. They were captured by hand from exposed and sheltered sandy beaches. Samples were preserved immediately in 95% ethanol for genome sequencing and stored in liquid nitrogen for RNA extraction.
DNA was extracted from a pool of seven individuals using a conventional phenol-chloroform protocol 29 . The purified DNA was resuspended in Tris-EDTA (TE) buffer (TE; 10 mM Tris-HCl, 1 mM EDTA, pH 7.5). For RNA isolation, several frozen whole bodies were mortar-pulverized in liquid nitrogen. The purified RNA was extracted in lysis buffer, containing 35 mM EDTA, 0.7 M LiCl, 7.0% SDS, and 200 mM Tris-Cl (pH 9.0), following the protocol by Woo et al. 30 . The purified RNA was eluted in DEPC-treated water and stored at −20 °C. k-mer distribution and genome size estimation. Prior to estimating the genomic size, we processed raw reads as follows. We discarded low-quality (<Q20) PE reads and those that contained the Truseq index and universal adapters. We then merged the high-quality PE reads using FLASH 31 , with default options to avoid double counting of overlapping reads. The estimated genome size of T. longiramus was ~1.116 Gb based on a k-mer distribution (K = 17) analysis run with JELLYFISH 32 . The main peak exists at k-mer depth 42, which was used for genome size estimation (Fig. 1).

Short and long
Genome assembly. Assembly, adapters, low-quality reads, and uncalled bases were trimmed from PE and MP raw reads using Platanus_trim and Plantanus_internal_trim, respectively. Initial assembly was performed with Platanus 33 based on automatically optimized multiple k-mer values. We executed individual commands www.nature.com/scientificdata www.nature.com/scientificdata/ "assemble, " "scaffold, " and "gap_close" in the Platanus assembler suite, successively. For the "assemble" stage, we assigned the maximum memory usages as 2,048 G, but all the other stages were executed with default options. Scaffolds larger than 1,000 bp in length scaffolded using trimmed PE and MP reads in SSPACE 34 (Fig. 2). Finally, we filtered out two bacterial sequences with more than 500 BLASTN bit scores of 90% alignment coverage identified in MEGAN 35 . We re-confirmed using BLASTX with a non-redundant database in DIAMOND 36 . Table 3 shows the assembly statistics for Platanus, SSPACE, and the final assembly.
repeat annotation. To annotate repetitive elements, we first identified tandem repeats using the Tandem Repeats Finder 37 . Transposable elements (TEs) were identified by combining de novo (RepeatModeler) 38 and homology-based approaches (Repbase 39 , RepeatMasker 40 , and RMBlast 40 ). TEs accounted for 20.35% of the genome, with tandem repeats accounting for the largest portion (6.18%) ( Table 4).     www.nature.com/scientificdata www.nature.com/scientificdata/ Gene prediction and annotation. The protein-coding genes were predicted by combining ab initio and homology-based gene prediction methods (Fig. 2). For the ab initio gene prediction, BRAKER 41 predicted 67,698 genes, which incorporated outputs from GeneMark-ET 42 and AUGUSTUS 43 . GeneMark-ET predicts genes with unsupervised training, whereas AUGUSTUS predicts genes with supervised training based on intron and protein hints. We generated two hint files from an Illumina RNA-seq and PacBio ISO-seq. Tophat 44 was used to align RNA-seq reads to the repeat-masked genome assembly. We proceeded with Iso-seq to obtain the protein sequences, as described in Minoche et al. 45 : (1) run LSC 46 to correct errors for full-length transcripts, (2) align the corrected transcripts to the genome using GMAP 47 , and (3) generate gene models from aligned sequences and extract the protein sequence from the generated gene model using Transdecoder 48 . We obtained 1,573 protein sequences, which were used to generate protein hints for AUGUSTUS by running Exonerate 49 . To remove incomplete gene sequences from genes predicted by BRAKER, we filtered out the predicted coding sequences (CDSs) using the following two criteria: 1) CDSs that contained premature stop codons and (2) CDSs that were not supported by hints. Finally, a total of 23,985 protein-coding genes were estimated by ab initio prediction (Table 5).
For the homology gene predictions, we searched the assembly of T. longiramus against Daphnia pulex, Drosophila melanogaster, Folsomia candida, H. azteca, Lepeophtheirus salmonis, Parasteatoda tepidariorum, P. hawaiensis, and arthropoda in orthoDB using TBLASTN 50 with an E-value cutoff of 1E-5. Matching sequences were clustered using GenBlastA 51 , and only best-matched regions were retained. Then, gene models were predicted using Exonerate 49 . Predicted gene sequences that did not meet the above criteria were discarded. As a result, a total of 9,913 genes were predicted by a homology-based approach (Table 5).
Finally, we combined the two outputs by placing homology predictions to ab initio prediction only when there is no conflict. As a result, 26,080 protein-coding genes were predicted for the T. longiramus draft genome (Table 5). Gene Ontology for the predicted genes were annotated using InterProScan with various databases 52

Data records
All DNA and RNA raw reads have been deposited in the NCBI SRA (Table 1) under the SRA study accession SRP199018 62 . The whole genome shotgun sequencing project was deposited in GenBank under accession VCRD01000000 63 . In addition, the assembled genome was submitted to NCBI Assembly and is available with accession no. GCA_006783055.1 64

technical Validation
DNa and rNa sample quality. DNA quality was assessed using Nanodrop, 1% agarose gels, Qubit fluorometer, and the Qubit HS DNA assay reagents. The RNA integrity was assessed using Nanodrop and an Agilent 2100 Bioanalyzer electrophoresis system (Agilent, Santa Clara, CA, USA). Genome assembly and gene prediction quality assessment. The length statistics of the genome assembly were assessed by QUAST 65 . The total assembly length is 0.89 Gb, which corresponds to 79.43% of the estimated genome size. The final scaffold N50 is 120.57 kb (Table 3). Genome completeness was evaluated using BUSCO 66 , with Arthropoda conserved genes databases. The genome assembly, after removing bacteria sequences from SSPACE, revealed a complete BUSCO value of 88.3%. However, in predicted genes, BUSCO completeness was higher (89.9%) ( After orthologous gene clustering, 490 single-copy protein sequences were aligned using MUSCLE 68 . Low alignment quality regions were filtered using trimAl 69 . A phylogenetic tree was constructed using RAxML 70 , with the PROTGAMMAJTT model (100 bootstrap replicates). Divergence time was calculated using MEGA7 71 with the Jones-Taylor-Thornton model and the previously determined topology (Fig. 3a). Calibration times of Parasteatoda-Drosophila divergence (601 MYA) and Strigamia-Drosophila divergence (583 MYA) were taken  Table 6. BUSCO assessment of genome assembly and gene prediction. www.nature.com/scientificdata www.nature.com/scientificdata/ from the TimeTree database 72 . We found that T. longiramus diverged from H. azteca during the Early Cenozoic era, approximately 55 million years ago.
A gene expansion and contraction analysis was conducted using the CAFE program 73 with the estimated phylogenetic information. A total of 122 gene families have expanded, and 388 gene families were contracted in T. longiramus. Fisher's exact test (p-value ≤ 0.05) was used to identify functionally enriched categories among expanded genes relative to the "genome background, " as annotated by Pfam (Supplementary Table 1). We observed that gene families associated with transferring glycosyl and acyl groups, ATPase activity, response to stress, homeostatic process, and transmembrane transport have expanded. Among transmembrane transport activities, we found that sodium/hydrogen exchanger genes were responsible for a wide range of cellular functions, such as cation movement, homeostasis, regulation of pH, and tolerating ionic and osmotic stress 74 . We also found several genes, such as ABC transporters responsible for efflux toxicants out of the cells 75 , sodium-independent organic anion transporter required for uptake of organic amphipathic compounds, and xenobiotic drugs 76 .
A Venn diagram of orthologous gene clusters was drawn on the basis of the protein sequences from T. longiramus (26,080 proteins) and two amphipods: H. azteca (17,509 proteins) and P. hawaiensis (28,617 proteins) (Fig. 3b). T. longiramus has 327 unique orthologous gene clusters found among these three genomes. Among these unique gene clusters, the top three gene clusters are DNA-and RNA-mediated transposition, iron ion binding, and DNA metabolic process. Several unique genes also were found in expanded gene families mentioned above (Supplementary Table 1).

Usage Notes
All analyses were conducted on Linux systems, and optimal parameters are given in the Code availability section.

Code availability
The software versions, settings, and parameters are described in Table 7. If not mentioned otherwise, the command line at each step was executed using default settings.  Table 7. A list of software and parameters used for genome analysis.