Background & Summary

Crustaceans are a diverse and ancient group of arthropods1, and are not only essential components of the marine and freshwater environments, but also an interesting model for the study of evolutionary biology and developmental biology. However, due to the high complexity, assembly of complete and exact crustacean genomes is difficult, let alone genomes at the chromosome level2.

Cherax quadricarinatus, also known as the red claw crayfish, is a large tropical freshwater crustacean with significant commercial interest for global aquaculture3. Intersexuality appears relatively widespread throughout gonochoristic crustaceans and has been reported in several crayfish species4. In red claw crayfish, the intersex individuals undergo a dramatic morphological and physiological sex shift, which makes it a fascinate model to study the mechanisms underlying sex determination and differentiation of crustacean. Although a genome of this species has been reported previously, with uncomplete and fragmental genome assembly (assembled genome size, 3.24 Gb and Contig N50, 33 kb), it still prevents many studies from going deep5. Here, we de novo assembled a chromosome-level genome of red claw crayfish with the assembled genome size of 5.26 Gb and contig N50 of 144,316 bp. This high-quality genome would enrich the genomic resources of crustaceans and provides basic data for further genome-wide selective breeding.

Methods

Sample collection and genomic sequencing

All samples used in this study were from a healthy male adult red claw crayfish farmed in Honghai Co., LTD., Zhejiang, China. Fresh muscle and haemolymph were used for whole genomic sequencing and Hi-C sequencing, respectively. Seven tissues including muscle, intestine, eyestalk, hepatopancreas, gills, stomach, and antennal gland were used for transcriptomic sequencing. Isolation of DNA/RNA, construction of libraries and genomic sequencing were carried out according to protocols from https://www.protocols.io/widgets/doi?uri=dx.doi.org/10.17504/protocols.io.bs8inhue.

For whole genomic sequencing (WGS), the genomic DNA was sonicated into ~250 bp fragments that used to build the 100 bp paired-end (PE100) sequencing library. The library was then sequenced on the BGISEQ-500 platform and generated 280.51 Gb raw data, which covered ~58X of the estimated genome (Table 1).

Table 1 Statistics of sequencing data.

For PacBio Continuous Long Reads (CLR) sequencing, seven sequencing libraries were constructed using ~20Kb high-quality molecular DNA fragments. All libraries were sequenced on the PacBio Sequel II platform, which generated 568.55 Gb raw data with an N50 of 17,393 bp (Table 1).

For the construction of Hi-C library, DNA was fixed with formaldehyde solution and isolated from nuclei, and digested with MboI, the digested fragments were labeled with biotinylated nucleotides. Eight libraries were sequenced on the BGISEQ-500 platform and produced a total of 542.71 Gb raw data, which covered ~105X of the estimated genome (Table 1).

Seven RNA libraries were constructed according to the protocols and sequenced on the BGISEQ-500 platform, generating a total of 136.96 Gb raw data (Table 1).

Genome survey

Raw PE100 reads were firstly filtered by SOAPnuke (v1.6.5)6 with parameters of “–M 1 –d –A 0.4 –n 0.05 –l 10 –q 0.4 –Q 2 –G –5 0”, and 240 Gb clean data were retained (Table 1). Then Jellyfish (v2.2.6)7 was used to count k-17mers and GenomeScope8 was used to estimate the size, heterozygosity, and repetitive sequences of the genome at 4.74 Gb, 0.86% and 85.6%, respectively (Fig. 1a).

Fig. 1
figure 1

Genome assembly of the red claw crayfish. (a) The 17-mer analysis of the genome. (b) The karyotypic analysis. The karyotype formula of the male is n = 100 = 36 m + 33 sm + 14 st + 17 t. (c) The linear regression analysis between sequence length and physical length of chromosomes. (d) Genomic features.

Chromosome karyotyping

The number and length of chromosomes in red claw crayfish were obtained by karyotyping experiment using 15 male adults, according to the published pipeline9. Chromosomes were measured using Adobe Photoshop CS6 measurement tools under a magnification of 600 × . The chromosome pairs were classified following the nomenclature of Levan (1964)10 into m = metacentric (long arm/short arm (r) = 1–1.7), sm = submetacentric (r = 1.7–3), st = subtelocentric (r = 3–7), and a = acrocentric (r > 7). The karyotype formula of the male red claw crayfish is n = 100 = 36 m + 33 sm + 14 st + 17 t (Fig. 1b), and the arm lengths data were listed in Supplementary Table 1.

Genome assembly

Reads longer than 5 kb were kept from raw Pacbio CLR reads and corrected by Canu (v1.5)11, based on which the draft genome was assembled by Wtdbg212 with parameters of “-p 21 -E 2 -S 4 -s 0.05 -L 5000 -X 40”. The draft genome was further polished by Pilon13 using clean PE100 reads with default parameters, giving an assembly with the size of 5.26 Gb and the contig N50 of 144.33 kb (Table 2).

Table 2 Summary of the genome assembly of red claw crayfish.

Based on the polished genome, 84.34 Gb Hi-C data were validated through quality control by Hi-C-Pro (v. 2.8.0)14, which were then applied for chromosomal reconstruction by Juicer (v1.5)15 and 3D-DNA (3D-de novo assembly)16. To get more precise chromosomes, we manually made some adjustments according to the chromosomal interaction heatmap by Juicebox17 (Fig. 2). Finally, a total of 4.70 Gb sequences were anchored to 100 chromosomes, of which the longest is 142.95 Mb and the shortest is 18.54 Mb (Supplementary Table 2). The linear regression analysis of karyotyping and assembly showed a high correlation (R2 = 0.9874) between the physical length and sequence length of 100 chromosomes (Fig. 1c), indicating the high-quality crustacean genome with the largest number of chromosomes ever reported.

Fig. 2
figure 2

The chromosome matrix heatmap.

Repeat annotation

Based on aligning the genome to the Repbase library by TRF (v.4.09)18, repetitive sequences were predicted by RepeatMasker (v. 3.3.0) and RepeatProteinMask (v. 3.3.0)19. In addition, transposable elements (TEs) were constructed and RepeatModeler (v1.0.8)20 (Table 3). All the above results together showed that red claw crayfish contains 78.69% repetitive sequences, among which TEs were most abundant (3,482 Mb) (Fig. 3, Table 4). Compared with other decapod crustaceans, the proportion of TES in crayfish was generally much higher.

Table 3 Summary of repetitive sequences.
Fig. 3
figure 3

Composition of the major TEs among 11 crustacean species.

Table 4 Summary of different TE repeat sequences.

Gene prediction

For homology-based gene prediction, the encoded protein sequences of six crustacean species include Cherax quadricarinatus (previous version), Eriocheir sinensis, Hyalella azteca, Macrobrachium nipponense, Penaeus vannamei, and Procambarus virginalis were aligned with the genomic sequence of red claw crayfish using BLAST20 and Genewise21 with default parameters. Augustus (v3.2.3)22 and Genscan23 were used for de novo gene prediction24. RNA reads were mapped to the genome by HISAT2 (v2.1.0)25 and gene structure were predicted by Stringtie (v1.2.2)26. Meanwhile, transcriptome was de novo assembled by Trinity (v2.1.1)27 and splicing variations were identified by PASApipeline (v2.4.1)28. EVidenceModeler (v1.1)29 was applied to integrate the above evidence and a total of 20,460 protein-coding genes were predicted, with average gene length and exon number per gene of 40,182.55 bp and 6.5, respectively (Tables 5, 6).

Table 5 Statistical results of gene structure prediction.
Table 6 BUSCO evaluation of gene annotation in red claw crayfish.

These genes were then functionally annotated through BLAST against NCBI non-redundant proteins (NR), TrEMBL, Gene Ontology (GO), SwissProt, and Kyoto Encyclopedia of Genes and Genomes (KEGG) protein databases. Finally, 16,859 genes accounting for 82.40% of the total were successfully annotated with at least one public functional database (Table 7).

Table 7 Summary of gene annotation in red claw crayfish.

The tRNAscan-SE30 was used to annotate the tRNAs based on annotated features such as isotype, anticodon, and tRNAscan-SE bit score. The rRNA sequences were annotated from homologous references in close species. MiRNAs and snRNAs were predicted by the INFERNAL31 based on the covariance model of the Rfam database. Totally 6,954 non-coding RNAs were predicted, including 25 miRNA, 1,448 rRNA, 5,023 tRNA and 458 snRNA genes (Table 8).

Table 8 Statistics of annotated non-coding RNAs.

Data Records

The genomic WGS sequencing data were deposited in the SRA at NCBI SRR2241264932, SRR2241264133.

The genomic PacBio sequencing data were deposited in the SRA at NCBI SRR2241265434.

The transcriptomic sequencing data were deposited in the SRA at NCBI SRR2241265135, SRR2241265236, SRR2241265337, SRR2241263738, SRR2241263839, SRR2241263940, SRR2241264041.

The Hi-C sequencing data were deposited in the SRA at NCBI SRR2241264242, SRR2241264343, SRR2241264444, SRR2241264545, SRR2241264646, SRR2241264747, SRR2241264848, SRR2241265049.

The final chromosome assembly was deposited in GenBank at NCBI JAPQEV00000000050.

The genome annotation file is available in figshare51.

Technical Validation

The quality and quantity of total DNA was checked using agarose gel electrophoresis, and the concentration was determined using a NanoDrop 2000 spectrophotometer. RNA integrity was evaluated using an Agilent 2100 Bioanalyzer (Agilent Technologies, CA, USA). The sample used in our study had an RNA integrity number (RIN) larger than 8. To further assess the quality of the genome, clean PE100 reads were aligned back to the genome by BWA52, showing the mapping rate as high as 99.03%. The depth and GC content were also statistically analyzed within a 10Kb sliding window. Moreover, 85.7% completed and 6.2% fragmented BUSCOs53 (Benchmarking Universal Single-Copy Orthologs, v4.0) in arthropoda_odb9 database were identified, which showed a noticeable improvement than the previous version (81.3%).