Chromosome-level genome assembly of ridgetail white shrimp Exopalaemon carinicauda

Exopalaemon carinicauda, a eurythermal and euryhaline shrimp, contributes one third of the total biomass production of polyculture ponds in eastern China and is considered as a potential ideal experimental animal for research on crustaceans. We conducted a high-quality chromosome-level genome assembly of E. carinicauda combining PacBio HiFi and Hi-C sequencing data. The total assembly size was 5.86 Gb, with a contig N50 of 235.52 kb and a scaffold N50 of 138.24 Mb. Approximately 95.29% of the assembled sequences were anchored onto 45 pseudochromosomes. BUSCO analysis revealed that 92.89% of 1,013 single-copy genes were highly conserved orthologs. A total of 44, 288 protein-coding genes were predicted, of which 70.53% were functionally annotated. Given its high heterozygosity (2.62%) and large proportion of repeat sequences (71.49%), it is one of the most complex genome assemblies. This chromosome-scale genome will be a valuable resource for future molecular breeding and functional genomics research on E. carinicauda.


Background & Summary
The family Palaemonidae, including more than 1400 species in 181 genera, represents the largest family of the order Decapoda 1 .Animals from this family are found in marine and freshwater environments in tropical to temperate regions worldwide.It includes several shrimps with high economic value, such as Macrobrachium rosenbergii, Macrobrachium nipponense and Exopalaemon carinicauda.The ridgetail white shrimp E. carinicauda is a eurythermal and euryhaline shrimp distributed over a wide geographical area throughout tropical, subtropical, and temperate coastal waters 2,3 .It can survive in a multitude of environmental extremes, has a broad salinity tolerance of 2-44 and can survive in freshwater after domestication 4 .It is also capable of inhabiting temperatures as low as −3 °C and as high as 39 °C5,6 .As one of the most commercially valuable pond-raised species of shrimp, E. carinicauda contributes to one third of the total production of polyculture ponds in eastern China 7 .
In addition to its important economic value in aquaculture, it is considered a potential ideal experimental animal for research on crustaceans for its moderate size, transparent body (Fig. 1), short reproductive cycle, large eggs (diameters ranging 0.57-1.08mm) and ease of culturing and breeding in captive conditions 8 .Currently, CRISPR/Cas9-mediated genome editing technology has been successfully used in E. carinicauda, which is the first time that gene editing has been realized in a decapod crustacean 9,10 .However, the absence of genomic data limits the further application of gene editing in studying the molecular biology, cytobiology and genetics of crustaceans.Therefore, a high-quality reference genome is essential for understanding the molecular biology, genetics, breeding, ecology and adaptation of E. carinicauda.
A fragmented draft genome of E. carinicauda has been assembled using Illumina short reads containing 13,897,062 scaffolds (contig N50, 263 bp) 11 .Genome survey analysis indicated that E. carinicauda has a relatively large genome size of 5.73 Gb, which is at least twice as large as that of many decapod shrimps [12][13][14] .In this study, an improved chromosome-level genome of E. carinicauda was assembled using the PacBio sequencing platform, Illumina paired-end sequencing, and high-throughput chromatin conformation capture (Hi-C) technology.Our previous studies suggested that the E. carinicauda karyotype is 2n = 90 15 , similar to that of other Exopalaemon species 16 .The final genome size was 5.86 Gb with a contig N50 length of 235.52 kb and a scaffold N50 length of 138.24 Mb.A total of 44,288 protein-coding genes were predicted in the genome of E. carinicauda.This chromosome-level genome assembly of E. carinicauda provides a valuable genomic resource for further genetic improvement and understanding of the functional genes and molecular mechanisms of E. carinicauda.

Methods
Animal materials and genome sequencing.A female shrimp was collected from Rizhao Haichen Aquatic Co., Ltd.The muscle tissue was collected for DNA extraction and library construction.Total genomic DNA was extracted using a cetyltrimethylammonium bromide method.For the genome survey, a 350 bp pairedend library was constructed according to the manufacturer's instructions (Illumina, San Diego, CA, USA) and sequenced on an Illumina NovaSeq 6000 platform.A total of 276.18 Gb of raw data were obtained, which covered approximately 54 × of the estimated genome (Table 1).
For PacBio sequencing, a 15 kb library was constructed using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA) and sequenced with circular consensus sequencing mode using a single 8 M SMRT Cell on the PacBio Sequel II platform (Pacific Biosciences).After filtering out the low-quality reads and sequence adapters, 3636.91 Gb subreads of PacBio Data were obtained, representing approximately 708 × sequence coverage based on the estimated genome size (Table 1).Finally, 203.27 Gb of CCS reads were generated using SMRTLink 9.0 which covered approximately 40 × of the estimated genome.
For the construction of the Hi-C library, DNA was fixed with 4% formaldehyde solution and digested with the 4-cutter restriction enzyme MboI.The digested fragments were labeled with biotin-14-dCTP, then the cross-linked fragments were subjected to blunt-end ligation.The library was sequenced on the Illumina   1).
Genome survey.The genome size and heterozygosity were estimated using the k-mer method before genome assembly 17 .The k-mer distribution was calculated from Illumina short reads using Jellyfish based on k-mer (k = 17) 18 .The heterozygosity ratio was estimated by the online tool of GenomeScope 19 (https://github.com/schatzlab/genomescope).Finally, the estimated genome size of E. carinicauda was predicted to be approximately 5.12 Gb, with 84.74% repetitive sequences, and the genome heterozygosity was 2.62% using a 17-mer analysis (Fig. 2), suggesting a complex genome of E. carinicauda.
Chromosome-level genome assembly.The initial genome was assembled with HiFi reads using the Peregrine (v0.1.6.1)(https://github.com/cschin/peregrine).A modified "best overlap graph" strategy was used to get the contig assembly based on the overlap graph.Contig overlaps were removed from the assembled contig sequences using Purge_dups (https://github.com/dfguan/purge_dups). De novo assembly of PacBio sequences yielded a preliminary assembly of 5.86 Gb, containing 47,421 contigs with a contig N50 length of 235.28 kb, a maximum length of 3,038,493 bp and a GC content of 34.79% (Table 1).
Chromosome-level assembly of E. carinicauda was conducted using Hi-C technology.Juicer (v1.6.2) 20 and 3D-DNA (v180922) 21 software were implemented to obtain the chromosome-level whole genome assembly.The filtered Hi-C reads were aligned to the initial draft genome using Juicer (v1.6.2).Only uniquely mapped and valid paired-end reads were used for the assembly using 3D-DNA.Juicebox (v1.9.8) was used to manually order the scaffolds to generate more precise chromosome-level genome of E. carinicauda according to the chromosomal interaction heatmap 22 .Contact maps were visualized using HiCExplorer (v3.3) 23 .The number of chromosomes was 90, which was determined based on karyological observations of E. carinicauda chromosomes in our previous study 15 .The contigs were ultimately clustered into 45 pseudochromosomes for E. carinicauda, with a scaffold N50 length of 138.24 Mb.The total length of the 45 pseudochromosomes was 5.58 Gb (covered 95.29%) (Fig. 3a,b), of which the length ranged from 46.25 Mb to 338.48 Mb.The length of the un-placed scaffolds was 275.86 Mb (Table 2).
The quality of the final chromosome-level genome assembly was assessed using the following three methods.First, we aligned the Illumina DNA short reads obtained from our previous study to the assembled genome and found that approximately 99.00% of the DNA short reads could be mapped to our assembly using BWA (v0.7.15) 24 .Second, read depth and GC content with 10 kb windows were used to evaluate the assembly results and determine whether there was a significant GC bias or sample contamination, showing that the assembled genome was clean without contamination (Fig. 4).Finally, genome assembly and completeness were further evaluated using conserved genes in benchmarking universal single-copy orthologs (BUSCO, v5.2.2) with the arthropoda_odb10 database 25 .The results showed that 92.89% of the 1013 single-copy genes were highly conserved orthologs (88.75% complete, 4.15% fragmented, and 7.11% missing) (Table 3).
Compared to the published genome of E. carinicauda 11 , our assembled genome is of significantly improved quality and integrity.The contig N50 increased from 263 bp to 235,277 bp, with an increase of nearly 900-fold, and scaffold N50 increased from 816 bp to 138,242,434 bp.Meanwhile, the assembled complete orthologue proportion enhanced from 43.44% to 88.75% according to the BUSCO assessment.repetitive and non-coding gene prediction.To detect repeat elements in E. carinicauda genome, de novo and homology-based strategies were combined using multiple methods.Mini-inverted repeat transposable elements (MITEs) were identified using MITE-Hunter (v1.0) 26 for de novo annotations.Long terminal repeat sequences (LTRs) were detected using LTRharvest 27 and LTR_Finder (v1.07) 28 , and the prediction results of these two software programs were integrated using LTR_retriever (v2.8.2) 29 .RepeatMasker (v4.1.0) 30was used in the homology-based alignment to search E. carinicauda genome sequence in the RepBase database (http://www.girinst.org/repbase).RepeatMasker was used to mask the repetitive sequences obtained by the above method, and RepeatModeler (v2.0) 31 was used to perform the de novo identification of other repetitive sequences with the repeat-masked genome.Ultimately, we identified approximately 4.19 Gb of repetitive sequences, accounting for approximately 71.49% of the assembled genome, among which 9.97% were tandem repeat sequences.Among these repetitive sequences, LTRs (42.52%) accounted for the highest proportion of the assembly, followed by DNA (10.81%)and LINE (3.33%) (Table 4).
These genes were functionally annotated using BLAST against NR, SwissProt, eggNOG, InterPro, GO and KEGG 45 .The protein-coding gene functional annotation results were merged using the aforementioned methods.Finally, 70.53% of the total predicted genes were successfully assigned with at least one functional annotation (Table 6).
The final chromosome-level assembled genome file has been uploaded to the GenBank database under the accession JAZBEV000000000 57 .

Technical Validation
To evaluate the integrity and accuracy of the genome assembly, the completeness of the final genome assembly was assessed using BUSCO (v5.2.2) and the arthropoda_odb10 database 25 .It was shown that 92.89% of the 1013 single-copy genes were highly conserved orthologs (88.75% complete, 4.15% fragmented, and 7.11% missing).By aligning the Illumina sequencing reads (PRJNA471201) 3 to the genome using BWA (v0.7.15) 24 , the read-mapping rate was 99.00%.This indicates a high mapping efficiency.Thus, the above results indicated that we obtained a high-quality genome of the E. carinicauda.

Fig. 1 A
Fig. 1 A lateral full-body view of the sequenced E. carinicauda.

Fig. 3
Fig. 3 Genome assembly of E. carinicauda.(a) Hi-C assembly of chromosome interactive heatmap.A deeper colour represents stronger interaction between contigs.(b) Characterization of assembled genome.a, Physical map of E. carinicauda pseudochromosomes (Mb scale), different colour represents different chromosome.b, proportional distribution of repeated sequences in 1 Mb window.c, gene density represented by number of genes in 1 Mb window.d, GC content represented by percentage of G/C bases in 1 Mb window.

Fig. 4
Fig. 4 GC content and depth distribution.The horizontal axis represents the percentage of GC content, and the vertical axis represents the average sequencing depth.

Table 1 .
Genome assembly statistics of E. carinicauda.NovaSeq 6000 platform, and approximately 552.65 Gb of Hi-C clean reads were generated, covering approximately 108 × of the estimated genome (Table

Table 6 .
Statistical results of gene function annotation.