Background & Summary

Collichthys lucidus (C. lucidus, FishBase ID: 23635, NCBI Taxonomy ID: 240159, Fig. 1), also called spiny head croaker or big head croaker, belongs to Perciformes, Sciaenidae, Collichthys and is mainly distributed in the shore waters of the northwestern Pacific, covering from the South China Sea to Sea of Japan1. C. lucidus is a commercially important marine fish species with high market value and has been widely consumed in coastal regions in China2.

Fig. 1
figure 1

A picture of Collichthys lucidus used for the genome sequencing.

At present, the research on C. lucidus mostly focused on phylogeny and population genetics3,4,5,6,7. C. lucidus exhibits apparent sex dimorphism on the growth rate that the female grow much faster than male ones; therefore, the understanding of its sex-determination would facilitate the development of the sex control technique in aquaculture industry to increase the annual yield. More interesting, our previous cytogenetic study showed that female C. lucidus had 24 pairs of acrocentric chromosomes (2n = 48a, NF = 48), while male ones had 22 pairs of acrocentric chromosomes, two monosomic acrocentric chromosomes and one metacentric chromosome (2n = 1 m + 46a, NF = 48)8. There is an X1X1X2X2/X1X2Y mechanism of the sex-chromosome type in C. lucidus, while Y is a unique metacentric chromosome in the male karyotype. Although multiple sex chromosome systems are found in several Perciformes species9, C. lucidus is the first reported case in the Sciaenidae species. At present, researches on the sex determination and differentiation mechanism in the Sciaenidae species are still lacking. Previous studies showed that no heterotropic chromosome was found in large yellow croaker (Larimichthys crocea) and spotted maigre (Nibea albiflora)10,11. As a close-related species in the same family, the chromosome comparison might provide insights into chromosome evolution among the species and the relationship to the evolution of sex-determination in Sciaenidae.

To obtain high-quality chromosome sequences of C. lucidus, we applied a combined strategy of Illumina, PacBio and Hi-C technology12 to sequence the genome of C. lucidus and reported the first chromosome-level assembly of this important species. The genome will be used for the functional gene mapping of the economic traits and the sex-determination of C. lucidus, as well as in the chromosome evolution investigations among Sciaenidae and teleosts.

Methods

Sample collection

A female wild-caught adult C. lucidus in Baima Harbor, Ningde, Fujian, China (26.7328°N, 119.7329°E) was used for the genome sequencing and assembly. The reason we chose a female sample is that the heterotropic chromosome in male might increase the technical challenge of genome assembly, especially for X1 and X2 chromosomes. Muscle, eye, brain, heart, liver, spleen, kidney, head kidney, gonad, stomach and intestines of the fish were harvested. All samples were rinsed with 1×PBS (Phosphate Buffered Solution) solution quickly, frozen with liquid nitrogen over 24 hours and then stored in −80 °C before sample preparation.

DNA extraction and sequencing

Phenol/chloroform extraction method was used in DNA molecules extraction from muscle tissues. The DNA molecules were used for sequencing on the Illumina (Illumina Inc., San Diego, CA, USA) and PacBio sequencing platform (Pacific Biosciences of California, Menlo Park, CA, USA). DNA library construction and sequencing in the Illumina sequencing platform were carried out according to the manufacturer’s instruction as in the previous study13. Briefly, the DNA extracted from muscle samples were randomly sheared to 300–350 bp fragments using an ultrasonic processor and paired-end library was constructed through the steps of end repair, poly(A) addition, barcode index, purification, and PCR amplification. The constructed DNA library was sequenced by Illumina HiSeq X platform in 150 PE mode. As a result of Illumina sequencing, we obtained 52.0 Gb raw genome data for C. lucidus. After the quality filtering, 51.35 Gb clean reads were retained as summarized in Table 1. Meanwhile, Genomic DNA molecules of C. lucidus were also used for one 20 kb library construction. Eleven flow cells were used in the PacBio Sequel platform to generate 90.7 Gb (109.3× coverage) polymerase sequencing data. After filtering adaptors in the sequencing reads, 90.5 Gb long reads were obtained for the following genome assembly (Table 1).

Table 1 Sequencing data used for the C. lucidus genome assembly.

RNA extraction and sequencing

Transcriptome of C. lucidus was also sequenced in this work for the gene prediction after the genome assembly. Muscle, eye, brain, heart, liver, spleen, kidney, head kidney, gonad, stomach and intestines tissues collected before from the same individual were used for RNA extraction with TRIZOL Reagent (Invitrogen, USA). The RNA molecules extracted from tissues were then equally mixed for RNA sequencing. According to the protocol suggested by the manufacturer, RNA sequencing library was constructed as the previous study14 and sequenced by Illumina HiSeq X Ten in 150PE mode (Illumina Inc., San Diego, CA, USA). Finally, ~9.8 Gb RNA-seq data were obtained (Table 1).

Genome survey and contig assembly

The genome size of the genome of C. lucidus was estimated with Illumina sequencing data using Kmer-based method implemented in GCE (v1.0.0)15 before genome assembly. Using Kmer size of 17, we obtain a Kmer frequency distribution for C. lucidus (Fig. 2). The genome size was estimated using the following equation: G = (L − K + 1) × nbase/(CKmer × L), Where G is the estimated genome size, nbase is the total count of bases, CKmer is the expectation of Kmer depth, L and K is the read length and Kmer size. Since Kmers with the depth smaller than three were likely from sequencing errors, we, therefore, revise the genome size by the following method: Grevise = G × (1 – Error Rate). As a result, we estimated female C. lucidus genome size of 830 Mb with the heterozygosity of 0.81% and the whole-genome average GC content of 42%.

Fig. 2
figure 2

Kmer frequency of C. lucidus. Note that the first, second and third peak was composed of the homozygous, heterozygous and repeated Kmers, respectively.

To assembly contig sequences using long-read data, the software Falcon v0.3016 was used for the contig assembling of the female genome of C. lucidus with default parameters. The genome assembly was performed by following steps in Falcon: First, daligner17 was used to generate read alignments, and the consensus reads were generated. Then, the overlap information among error-corrected reads were generated by daligner. Finally, a directed string graph was constructed from overlap data, and contig path were resolved by the string graph. Two round of sequence polishing was performed as follows: the assembled genome sequence was first polished with arrow18 using PacBio long reads, and Pilon19 was then used with Illumina sequencing data. In the end, we yielded a final genome contig assembly of C. lucidus with a total length 877.4 Mb with 2,912 contigs and a contig N50 of 1.10 Mb. (Table 2).

Table 2 Assembly statistics of C. lucidus.

Chromosome assembly using Hi-C data

To obtain a chromosome assembly of C. lucidus, we applied the Hi-C technique to generate the interaction information among contigs. 1 g muscle tissue was used for Hi-C library construction. The processes of crosslinking, lysis, chromatin digestion, biotin marking, proximity ligations, crosslinking reversal, and DNA purification steps were used in previous studies20. The Hi-C library was sequenced in Illumina HiSeq X Ten platform, and 193.1 Gb Hi-C reads were generated (Table 1). The reads were aligned to the assembled contig sequences using Bowtie software, and the alignment was filtered as our previous study21. The interaction matrix among contig was generated, and Lachesis22 was then applied to anchor contigs into chromosomes with the agglomerative hierarchical clustering method. Finally, we successfully scaffolded 2,134 contigs into 24 chromosomes, representing 96.86% of the total assembled genome. The contig and scaffold N50 of the chromosome assembly was 1.1and 35.9 Mb, respectively. We noted that there are 865 contigs cannot reliably be anchored to any chromosome, and the N50 length of unanchored contigs was 49.4 kb, which was significantly smaller than that of 1.16 Mb for anchored contigs.

Gene prediction and functional annotation

The repetitive sequences in the C. lucidus genome sequences were annotated through a combination of homology prediction and ab initio prediction. RepeatMasker (http://www.repeatmasker.org/)23 and RepeatProteinMask were applied for searching against RepBase database (http://www.girinst.org/repbase). We used Tandem Repeats Finder (TRF)24 and LTR-FINDER25 with default parameters for ab initio prediction. As a result, we identified 304.40 Mb of the assembled C. lucidus genome as repetitive elements, accounting for 34.68% of the total genome sequences. The repetitive elements were masked in the C. lucidus genome sequences, and the repeat-masked genome was used for the gene prediction.

The protein-coding gene annotation was identified by a combined strategy of homology-based prediction, ab initio prediction, and transcriptome-based prediction method. The protein sequences of several teleosts, including Danio rerio (GCF_000002035.6), Dicentrarchus labrax (GCA_000689215.1), Gasterosteus aculeatus (GCA_000180675.1), Oryzias latipes (GCF_002234675.1) and Takifugu rubripes (GCF_000180615.1) were mapped upon the assembled C. lucidus genome using TBLASTN26. The alignments were conjoined by Solar software27. GeneWise28 was used to predict the exact gene structure of the corresponding genomic region on each BLAST hit. Furthermore, the sequences from RNA-seq were aligned to the assembled C. lucidus genome to identify potential exon regions by TopHat29 and Cufflinks30. Then, Augustus31 was also used to predict coding regions in the repeat-masked genome sequences. All these results were merged by MAKER32, leading to a total 28,602 protein-coding genes (Table 3). After homolog searching against to NCBI non-redundant protein (NR)33, TrEMBL34, Gene Ontology (GO)35, SwissProt34, Kyoto Encyclopedia of Genes and Genomes (KEGG)36, InterPro37, 28,032 (98.01%) protein-coding genes were annotated with at least one public functional database (Table 4).

Table 3 General statistics of predicted protein-coding genes.
Table 4 General statistics of gene function annotation.

Repeat distribution and potential sex-determination gene identification

The distribution of repetitive elements along chromosomes was plot in Fig. 3. The repeats were generally concentrated at the two ends of the chromosomes, especially on the beginning end of the chromosome 1 in the assembled C. lucidus genome. Our previous cytogenetic analysis revealed that a chromosome with ending massive repeats was involved in the formation of Y specific metacentric chromosome8, we therefore speculated that chromosome 1 might be one of the two chromosomes in the sex chromosome fusion. Twenty one potential key genes in sex development of teleost were identified along the assembled C. lucidus genome (Fig. 3), facilitating the gene expression and functional studies aiming to the deciphering the sex-determination of C. lucidus. We identified the only one copy of Dmrt1 gene (dsx- and mab-3 related transcription factor 1) in the chromosome 11. Our previous studies on the studies of L. crocea10 and N. albiflora11 revealed that Dmrt1 was a key gene in sex-determination of two species, we therefore speculated the Dmrt1 gene might also play an central role in sex-determination process of C. lucidus. The sequences of chromosomes and genes provided valuable resource for the following sex-determination investigations.

Fig. 3
figure 3

Repetitive element distribution and potential sex-determination gene identification in the chromosomes of C. lucidus. The color bar represented the density of repetitive elements (number per 100 kb) along the genome and 21 key genes involving in teleost sex-determination that reported in previous studies were identified and label on chromosomes.

Data Records

The genomic Illumina sequencing data were deposited in the Sequence Read Archive at NCBI SRR820833238.

The genomic PacBio sequencing data were deposited in the Sequence Read Archive at NCBI SRR814290139.

The transcriptome Illumina sequencing data were deposited in the Sequence Read Archive at NCBI SRR820833140.

The Hi-C sequencing data were were deposited in the Sequence Read Archive at NCBI SRR820830141.

The final chromosome assembly were deposited in the GenBank at NCBI SCMI0000000042.

The genome annotation file is available within figshare43.

The sequences of potential sex-determination genes identified from the assembled C. lucidus genome is available within figshare44.

Technical Validation

The quality of the DNA molecules was checked by agarose gel electrophoresis, showing the main band around 20 kb, and the extracted DNA spectrophotometer ratios (SP) were 260/280 ≥ 1.8.

The quality of the purified RNA molecules were checked by Nanodrop ND-1000 spectrophotometer (LabTech, USA) as the absorbance >1.7 at 260 nm/280 nm and 2100 Bioanalyzer (Agilent Technologies, USA) as the RIN of 8.0.

The raw reads from Illumina sequencing platform were cleaned using FastQC45 and HTQC46 by the following steps: (a) filtered reads with adapter sequence; (b) filter PE reads with one reads more than 10% N bases; (c) filtered PE reads with any end has more than 50% inferior quality (< = 5) bases.

The quality of the assembled genome were validated on terms of the completeness, accuracy and conservation synteny. Firstly, the completeness of the genome sequences was validated by the alignments of PacBio long reads.Minimap247 with default parameters was applied to map the CLR (Continuous Long Reads) subreads of C. lucidus back to the final chromosome assembly. We found that about 96.2% of the long reads could be aligned to the assembled genome, and the average depth of the alignment along the genome was 103 × . More than 99.78% and 98.1% of the genome sequences were aligned by at least 1× and 20× coverage, respectively. Secondly, we further confirmed the completeness of the assembled genome using BUSCO v3.048. As a result, 97.6% and 97.4% BUSCO genes were completely or partially identified in the assembled C. lucidus genome with the vertebrate and actinopterygii database, respectively. Thirdly, the accuracy of the genome assembly was evaluated by variants calling using Illumina data. The short reads were mapped to the genome sequences with BWA49. The insertion length distribution with one peak agreed well with our experimental design, suggesting the accuracy of the genome assembly. SNP calling with read alignments in GATK50 resulted in 2,593,807 heterozygous and 11,282 homozygous SNP loci along the genome sequences, suggesting the base-level accuracy of 99.999% for the genome assembly. Fourthly, the conservation synteny between C. lucidus and L. crocea51 were compared to validate the chromosome assembly. We observed a highly conserved synteny and strict correspondence of chromosome assignment (Fig. 4).

Fig. 4
figure 4

Chromosome comparison of C. lucidus to L. corcea using protein-coding genes synteny. The chromosome id of C. lucidus were sorted by the sequence lengths.