Background & Summary

Zacco platypus is one of the endemic colorful minnows that are widespread in the freshwater ecosystems of East Asia1. It’s often used to assess the contaminant on aquatic environment of North Korea, South Korea and China as a test model and indicator species2,3,4,5. Recently, for the unique nupitial characteristics (sexual dimorphism and dichromatism, elongated anal fin and nuptial tubercles of the male), Z. platypus has become an important emerging native ornamental fish in China.

Z. platypus has undergone a long and complex taxonomic history. As the type species of genus Zacco, Z. platypus was first described from Nagasaki, Japan6. It was successively placed in Cyprinidae, Leuciscinae, Zacco7 and Cyprinidae, Danioninae, Zacco8. After a series of revisions9,10,11,12, it is currently assigned into Xenocyprididae, Opsariichthyinae, Zacco13. The genus Zacco was established in 1902, and the discriminating criterion for Zacco and Opsariichthys was that “Opsariichthys is presence of peculiar notched jaws, but it is absent in Zacco14. After a series of taxonomic studies15,16,17,18,19, the diagnostic features of Zacco were modified to: (1) “the nuptial tubercles on the cheeks are united basally to from a plate in male” commented by Jordan and Hubb20; and (2) the fused light green lateral crossbars into fewer large patches which can be well separate from other members of Opsariichthys21. A series of population genetics research using mitochondrial cytochrome b (Cytb) fragments and intron polymorphism revealed that Chinese Z. platypus contained multiple molecular lineages22,23,24. The morphological comparisons and genetic analyses using AFLP makers indicated that O. evolans was a valid species, which had been proposed as a synonym of Z. platypus25. Particularly, O. evolans and O. acutipinnis had been reported as synonym species of Z. platypus8. Therefore, molecular lineages from upper-middle Yangtze and Pearl River basins should be regarded as the members for O. acutipinnis-O. evolans complex. Consequently, it was once thought that the genus Zacco should be restricted only to the type species, Z. platypus, which may be merely distributed from Japan to the north of Zhejiang Province, China26.

In the last few years, there have been some new opinions on the taxonomy of Zacco. Molecular analysis based on three nuclear genes suggested that Z. acanthogenys might be a valid species, while no comprehensive diagnostic feature has been reported, other than the red upper iris27. More recently, Z. sinensis sp. nov and Z. tiaoxiensi sp. nov were described by using morphological and mitochondrial data28,29. Due to its wide distribution, Z. platypus exhibits great morphological flexibility. Our site survey found that the body size and color pattern varied in different river basins of China, and even in different drainages of the same river basin. Limited nuclear genes or mitochondrial markers represent only a small percentage of the genome or are of maternal origin, which may lead to biases when drawing systematic and taxonomic conclusions30. Thus, the taxonomy of Zacco is still in debate. To facilitate taxonomic and phylogenetic studies of Zacco fishes, genome-wide genetic information is urgently needed. Although Xu et al. has reported the whole genome of Z. platypus31, the chromosome-level genome assembly of this species is still unavailable. Here, we assembled a high-quality chromosome-level genome of East Asia endemic minnow Z. platypus. This new assembly will greatly improve the systematic and taxonomic study of genus Zacco. Furthermore, access to the genomic data set will facilitate the use of Z. platypus as an indicator organism for assessing the contaminant on aquatic environment.

Methods

Sample collection and genome sequencing

A healthy female Z. platypus was collected from Xingtai City, Hebei Province of China (37.0750 °N, 113.9221 °E). High-quality genomic DNA was extracted from muscle tissue for genome libraries construction, and then the library construction and sequencing work were completed at Frasergen Co., Ltd. (Wuhan, China). For short-read sequencing, the Illumina Hiseq X-10 platform (Illumina, San Diego, CA, USA) was used to perform paired-end sequencing with an insert size of 300~350 base pairs (bp). For long-read DNA sequencing, the PacBio sequencing was performed on a PacBio Sequel II platform with continuous long-read (CLR) mode.

To anchor scaffolds onto the chromosome, a chromosome conformation capture (Hi-C) library was prepared using muscle tissue. The Hi-C library was constructed following the standard protocol described previously32, and sequenced on an Illumina Hiseq X-10 platform (Illumina, San Diego, CA, USA).

In addition, total RNAs from the tissues of muscle, blood, brain, liver, and spleen were extracted for Iso-Seq using Qiagen RNeasy Mini Kit (Qiagen, Hilden, Germany). The RNA samples from 5 tissues were equally mixed. An Iso-Seq cDNA library was constructed according to the PacBio standard protocol with the BluePippin size selection system (Sage Science, MA, USA) and sequenced on the PacBio sequel II platform.

Karyotypic analysis

An adult female Z. platypus individual collected from the same location with the sequencing individual was used for karyotyping experiment, according to the published pipeline33. Chromosomes were photographed using a Leica DM4 B fluorescence microscope (Leica, Wetzlar, Gemany). Chromosome classifications were made by the standardized nomenclature34. The result showed that Z. platypus has a chromosome number of 2n = 48 and a karyotype formula of 18 M + 24SM/ST + 6 T (Fig. 1A).

Fig. 1
figure 1

Karyotype and genomic information visualization of Zacco platypus. (A) The image and karyotype of Z. platypus. (B) Heat map of interactive intensity between chromosome sequences of Z. platypus anchored by Hi-C. (C) Circos plot of 24 assembled chromosomes for Z. platypus genome. From the outside to the inside, the tracks indicate 24 chromosomes, GC content (bin = 1 Mb), gene density (bin = 1 Mb), repetitive sequence density (bin = 1 Mb) and the major interchromosomal syntenic relationships, respectively.

Genome assembly

The Illumina sequencing produced 84.32 Gb clean data after the quality control (Table 1). The genome size, repeat content and heterozygosity were estimated by K-mer analysis with Illumina short reads. Frequencies of K-mers (K = 17) were counted using Jellyfish v2.2.635. The genome size was estimated to be approximately 818.15 Mb, with a heterozygosity of 0.37% and 47.72% of repeat sequences. Then, the genome assembly was conducted with the obtained 172.05 Gb PacBio data using the Falcon assembler v0.3 (Table 1). The draft genome was further polished by gcpp v2.0.2 (https://github.com/PacificBiosciences/gcpp) and pilon v1.2236 to improve the quality of genome assembly. This preliminary assembly of Z. platypus genome was 814.87 Mb in length with an N50 of 8.10 Mb (Table 2).

Table 1 Statistics of sequencing data.
Table 2 Statistics of Zacco platypus genome assembly.

Subsequently, 151.17 Gb Hi-C data were aligned to the assembly using the Juicer v1.6.237 (Table 1). The contigs were ordered and anchored with Hi-C data using the 3D-DNA38 and manually adjusted using Juicebox Assembly Tools v1.11.0839. Finally, the Hi-C interaction heatmap demonstrated an excellent quality of the genome assembly (Fig. 1B). Approximately 98.64% of the contig sequences were anchored to 24 chromosomes, which is consistent with the karyotype analysis in this study (Fig. 1A). The Circos40 was used to visualize the 24 chromosomes, GC content, gene density, repetitive sequence density and major interchromosomal syntenic relationships (Fig. 1C). The longest and shortest chromosomes were 46.87 Mb and 25.28 Mb in length, respectively (Table 2). The N50 reached 32.32 Mb for the final genome assembly (Table 2). The assembly completeness was evaluated by Benchmarking Universal Single-Copy Orthologs (BUSCO) v3.0.241 with actinopterygii_odb10. We found that 96.30% of BUSCO genes were completely detected in the final assembly.

Repeat annotation

The repetitive elements in the genome of Z. platypus were annotated by using a combination of homology-based and ab initio approaches. For the homology-based approach, the repeat sequences were identified with RepeatMasker v4.0.9 and RepeatProteinMasker v4.0.9 (http://www.repeatmasker.org/) using Repbase database (http://www.girinst.org/repbase/). For the ab initio approach, RepeatModeler v1.0.11 (http://www.repeatmasker.org/RepeatModeler/) and LTR-FINDER software v1.0.542 were used to build an ab initio repeat sequence library, and then RepeatMasker v4.0.9 was used to predict repeat sequences. Furthermore, TRF v4.0943 was used to find tandem repeats in the genome. Finally, a total of 426.68 Mb repetitive sequences were identified by combining the de novo, and homology-based approaches, accounting for 52.35% of the whole genome (Table 3). In detail, 403.00 Mb (49.45%) of TEs, including 259.97 Mb DNA repeat elements (31.90%), 56.25 Mb long interspersed nuclear elements (LINE, 6.90%), 6.20 Mb short interspersed nuclear elements (SINE, 0.76%), 104.51 Mb long terminal repeat elements (LTR, 12.82%), and 35.45 Mb unknown elements (4.35%) were detected (Table 4).

Table 3 Summary of repetitive sequences.
Table 4 Statistics of repetitive sequence classification results.

Gene annotation

To obtain high quality protein-coding genes of Z. platypus genome, a comprehensive strategy combining homology-based prediction, transcript-based prediction and de novo prediction was employed. For the homology-based prediction, protein sequences from Ancherythroculter nigrocauda (GCA_036281575.1), Danio rerio (GCA_000002035.4), Onychostoma macrolepis (GCA_012432095.1), Carassius auratus (GCA_003368295.1), O. bidens (GWHBEIO00000000) were downloaded from Ensembl database (http://www.ensembl.org) and NGDC database (https://ngdc.cncb.ac.cn/). These sequences were aligned to the Z. platypus genome using Exonerate software44. Meanwhile, a total of 77.90 Gb clean data was generated with Iso-Seq, and 32,860 transcripts with a mean length of 2582 bp were obtained with the Iso-Seq workflow. For the transcript-based prediction, PASA45 was used to annotate gene structure with the full-length trancripts. For the de novo prediction, the gene structure was identified with Augustus v3.346 and GenScan v1.047. All data were then integrated using MAKER248. PASA was used to further refine the gene structure based on transcriptome data and a total of 24,779 protein-coding genes were predicted, with average gene length and exon number per gene of 17,588.54 bp and 9.41, respectively (Table 5).

Table 5 Statistics of gene prediction.

Gene function annotation was performed by aligning the genes to several databases, including NCBI Nr, Swiss-Prot49, Pfam50, GO51, KEGG52, InterPro53, and TrEMBL54 using BLASTP (e-value ≤ 1e−5, max_target_seqs 1). Finally, 22,823 genes accounting for 92.11% of the total were successfully annotated with at least one database (Table 6). The annotated genes contained 91.40% complete and 2.70% fragmented BUSCOs using actinopterygii_odb10, indicating that the annotation has high completeness.

Table 6 Statistics of Zacco platypus genome annotation.

Finally, tRNAscan-SE55 and BLASTN was used to predict tRNA and rRNA sequences in the genome, respectively. Additionally, miRNA and snRNA sequences were identified with Infernal program with Rfam56. The genomic noncoding RNAs, including 668 microRNAs (miRNAs), 10,272 transfer RNAs (tRNAs), 1332 ribosomal RNAs (rRNAs), and 534 small nuclear RNAs (snRNAs) were identified in the genome (Table 7).

Table 7 Statistics of noncoding RNA annotation result.

Data Records

All the raw sequencing data utilized in this study were submitted to the National Center for Biotechnology Information (NCBI) SRA (Sequence Read Archive) database under BioProject accession number PRJNA1028840. Specifically, the Illumina WGS data was archived with the accession number SRR2645619157, while the PacBio WGS data was deposited with the accession number SRR2645618958. The Iso-Seq and Hi-C data sets were archived under the accession numbers SRR2645618859 and SRR2645619060, respectively. The final chromosome assembly has been deposited at GenBank under the accession number JAYDZZ00000000061. The genome annotation file has been deposited at the Figshare62.

Technical Validation

The quality scores across all bases and GC content of the Illumina raw sequencing data were inspected by FastQC v0.11.9 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). BUSCO v3.0.2 was used for quantitative assessment of genome assembly and evaluating the completeness of protein-coding annotation with the actinopterygii_odb1041.