Background & Summary

Yellow-cheek carp (Elopichthys bambusa), also known as “water tiger”, is a species in the order Elopichthys, subfamily Leuciscinae and family Cyprinidae. Yellow-cheek carp is a typical large and ferocious carnivorous fish endemic to East Asia. In China, it is mainly distributed in river systems such as the Yangtze River, Pearl River and Yellow River1. Yellow-cheek carp lives in the upper layer of rivers and lakes, it has a strong swimming ability and chases other fish for food. Yellow-cheek carp can prey on diseased and weak fish to control their population size, which is of great significance for maintaining the ecological balance of the water environment2. Yellow-cheek carp is also an important characteristic economic fish with firm meat, delicious taste, and rich in high-quality protein, unsaturated fatty acids, minerals and other nutrients3,4,5. However, anthropic factors such as overfishing, hydrological modification and water pollution have led to the dwindling natural resources of yellow-cheek carp6,7, which has been listed in the “Key Protected Endangered and Threatened Aquatic Species” and the IUCN Red List of Threatened Species (Version 2020.3)8.

The typical carnivorous yellow-cheek carp is particularly special among East Asian carp species that are mainly omnivorous and herbivorous. For example, yellow-cheeked carp and grass carp both belong to the subfamily Leuciscinae and had the closest relationship. Interestingly, they have evolved completely opposite feeding habits9, which provides excellent material for studying the evolution and genetic regulation mechanisms of fish feeding habits. However, the lack of genomic information limits the study on the carnivorous formation mechanism of yellow-cheek carp. At the same time, higher breeding profits have also promoted the continuous development of the artificial breeding industry of yellow-cheek carp. Using live fish or frozen fish as the main bait not only results in higher breeding costs for yellow-cheeked carp, but also easily causes pollution of the aquaculture water, which greatly restricts the expansion of the farming scale10. Therefore, research on the dietary transformation of typical carnivorous fishes such as yellow-cheek carp has gradually become a hot topic, and there is an urgent need for genetic breeding of yellow-cheek carp based on whole-genome information.

In this research, we have combined PacBio long-read sequencing, Illumina short-read sequencing and Hi-C technology to generate a high-quality chromosome-level genome of the yellow-cheek carp (Fig. 1). Accordingly, we expect rapid progress in the genetics research of yellow-cheeked carp, and functional genes related to key economic traits of yellow-cheeked carp will continue to be discovered. The elucidation of the genome structures and functions will promote more in-depth research to better understand the genetic basis for the formation of important traits such as the carnivorous in yellow-cheeked carp, thereby making contributions to its resource protection, genetic selection and artificial breeding.

Fig. 1
figure 1

Characterization of assembled yellow-cheek carp genome. Circos plot of the yellow-cheek carp genome, with visualization of gene density (1), TRP (2), LTR (3), SINE (4), LINE (5) and GC content (6) in order from outside to inside.


Sample collection and sequencing

An adult male yellow-cheek carp was collected from the Yangtze River in Wuhan, Hubei, China. High-quality genomic DNA was extracted from muscle by the CTAB method for Illumina sequencing, PacBio SMRT sequencing11 and Hi-C. The quality of the extracted DNA was assessed using agarose gel electrophoresis and NanoDrop Spectrophotometer (Thermo Fisher Scientific, USA), and quantified by a Qubit Fluorometer (Invitrogen, USA).

For Illumina sequencing, the genomic DNA was randomly sheared to 300~500 bp fragments, and a paired-end genomic library was prepared following the manufacturer’s protocol. Then, the library was sequenced on an Illumina NovaSeq platform using a paired-end 150 bp layout to enable genome survey and base-level correction. For PacBio long-read sequencing, SMRTbell libraries were constructed using the genomic DNA and sequenced on the PacBio Sequel II sequencing platform. After, approximately 58.98 Gb of Illumina short-read data (coverage of 71.31×) and 27.35 Gb of PacBio continuous long reads (CLR) data (coverage of 32.65×) was obtained.

To generate a chromosomal-level assembly of the yellow-cheek carp genome, a Hi-C library was generated using the DNA extracted from the same yellow-cheek carp. After cell crosslinking, cell lysis, chromatin digestion, biotin labelling, proximal chromatin DNA ligation and DNA purification, the resulting Hi-C library was subjected to paired-end sequencing with 150 bp read lengths on an Illumina NovaSeq platform. Finally, the size of Hi-C data obtained was 151.98 Gb, covering 183.78× of the genome.

To aid genome annotation, the total RNA from muscle, spleen, gonad and skin was extracted and tested for purity and integrity using a NanoDrop Spectrophotometer (Thermo Fisher Scientific, USA) and Agilent 2100 bioanalyzer (Agilent Technologies, USA). The RNA library was constructed using the NEBNext® UltraTM RNA Library Prep Kit (Illumina, USA) following the manufacturer’s protocol and sequenced on an Illumina NovaSeq. 6000 platform. Finally, 23.74 Gb of data was obtained (Table 1).

Table 1 Statistics of the sequencing data used for genome assembly.

Genome assembly

First, SOAPnuke (v2.1.0)12 was used to perform quality control of Illumina data, and the clean data were utilized for genome size estimation. K-mer analysis13 was conducted using GCE (v1.0.2). As a result, the genome size was estimated to be 786.16 Mb, with a heterozygosity ratio of 0.47% and repeat sequence ratio of 47.03% (Table 2). A total of 27.35 Gb PacBio long-read data were used for de novo genome assembly using MECAT2 (v2.0.0)14 and NextDenovo (v2.4.0). The polishing was then carried out by the software gcpp (v2.0.2) and pilon (v1.22)15. Based on these sequencing data, the resulting assembly consists of 170 contigs and has a total length of 827.63 Mb (Table 3).

Table 2 K-mer frequency and genome size evaluation of yellow-cheek carp genome.
Table 3 Statistics for Hi-C assisted assembly.

Hi-C scaffolding

The Hi-C technology was used for chromosome-level genome assembly. The Trimmomatic16 with parameters (LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50) was used to remove adapters and low-quality fragments of the raw Hi-C reads data. The processed reads were then aligned to the assembly using the Juicer (v1.6)17 with default settings. Contigs were scaffolded using 3D-DNA pipeline18 with all valid Hi-C reads. We use the Juicebox (v2.13.07)17 to adjust the chromosome-scale scaffolds manually(Fig. 2, Table 4). And there are 141 gaps among the 24 chromosomes.

Fig. 2
figure 2

Genome-wide Hi-C interaction mapping of chromosome sections.

Table 4 Chromosome and reference genome corresponding chromosome statistical results.

Repeat annotation

We used de novo prediction and homology comparison to annotate the genomic repetitive sequences. RepeatModeler19 were used to detected and classified the repetitive sequences in the genome assembly using tools including RECON(v1.08)20, RepeatScout(v1.0.5)21, LTR-FINDER(v1.0.5)22 and TRF (v4.0.935)23. For homology comparison, RepeatMasker (open-4.0.9) and RepeatProteinMask (open-4.0.9) were used to identify the known TEs of the yellow-cheek carp genome in the Repbase TE library24,25 and TE protein database, respectively. The results showed that the genome repetitive sequence size was 456.66 Mb, accounting for 55.17% of the assembled genome. Among the repeat elements, short interspersed nuclear elements (SINEs) accounted for 0.24% of genome size and long interspersed nuclear elements (LINEs) accounted for 7.67%. Long terminal repeats (LTRs) and DNA elements accounted for 12.31% and 34.87%, respectively (Table 5).

Table 5 Repetitive elements and their proportions in yellow-cheek carp genome.

Protein-coding gene prediction and annotation

In this research, the ab initio gene prediction, homology-based gene prediction and transcript prediction were used to predicted protein-coding genes of the yellow-cheek carp genome. Prior to gene prediction, the assembled yellow-cheek carp genome was hard and soft masked using RepeatMasker. The ab initio gene prediction was performed using Augustus (v3.3.1)26,27 and Genescan (v1.0)28. Models used for each gene predictor were trained from a set of high-quality proteins generated from the RNA-Seq data. For the homology-based prediction, Glimmer HMM(v3.0.4)29 was used to align the protein sequences to our genome assembly and predict coding genes with the default parameters. The reference protein sequences of five fish species, including Ctenopharyngodon idella, Sinocyclocheilus grahami, Megalobrama amblycephala, Danio rerio and Cyprinus carpio, were sourced from the NCBI database. For the transcript prediction, clean RNA-Seq reads were assembled into the yellow-cheek carp genome using Stringtie (v2.1.1)30. Then the gene structure was formed using PASA (v2.4.1)31. To consolidate the results from these three methods, MAKER (v3.00)32 was employed to enable the merging and integration of gene predictions.

For functional annotation of predicted gene, BLASTP (v2.6.0)33,34 was used to align the anticipated genes to the Kyoto Encyclopedia of Genes and Genomes (KEGG)35, Gene Ontology (GO)36, NCBI-NR (non-redundant protein database), Swiss-Prot37, TrEMBL38 and InterPro39 database. In total, we successfully predicted 24,153 protein-coding genes within the genome. These predicted genes displayed an average coding sequence length of 1638.21 bp, an average gene length of 18969.98 bp, and an average exon number of 9.87 (Table 6). Further, 22,965 genes, which accounts for 95.54% of the total number of predicted genes, were successfully assigned with at least one functional annotation (Table 7).

Table 6 Basic statistical results of gene prediction.
Table 7 Functional annotation statistics.

Annotation of non-coding RNA genes

The tRNAscan-SE (v1.3.1)40 algorithms with default parameters were used to identify the genes associated with tRNA. We downloaded the closely related species rRNA sequences from the Ensembl database. Then rRNAs in the database were aligned against our genome using BLASTn (v2.6.0)41 with E-value <1e-5, identity ≥85% and match length ≥50 bp. The miRNAs and snRNAs were identified by Infernal (v1.1.2)42 software against the Rfam (v14.1) database with default parameters. As a result, we annotated 76 rRNAs, 2469 tRNAs, 291 MiRNAs and 212 snRNAs (Table 8).

Table 8 Statistics of non-coding RNA annotation.

Data Records

All the raw sequencing data have been deposited in the NCBI database under the accession number SRP47030643. The genome assembly has been deposited at GenBank under the accession GCA_037101425.144. Genome annotations, along with predicted coding sequences and protein sequences, can be accessed through the Figshare45.

Technical Validation

The BUSCO was used to evaluate the quality of the genome assembly. We assessed assembly completeness using BUSCO (v3.0.259)46 with the reference arthropod gene set (n = 3,640). The final genome assembly showed a BUSCO completeness of 98.4%, consisting of 3,538 (97.2%) single-copy BUSCOs, 45 (1.2%) duplicated BUSCOs, 26 (0.7%) fragmented BUSCOs, and 31 (0.9%) missing BUSCOs (Table 9). Comparison of BUSCO results with Squaliobarbus curriculus (95.8%) and Mylopharyngodon piceus (96.0%) revealed the high genome assembly quality of yellow-cheeked carp47.

Table 9 Statistical result of BUSCO evaluation results of genome assembly.