A chromosome-level genome assembly of East Asia endemic minnow Zacco platypus

Zacco platypus is an endemic colorful freshwater minnow that is intensively distributed in East Asia. In this study, two adult female individuals collected from Haihe River basin were used for karyotypic study and genome sequencing, respectively. The karyotype formula of Z. platypus is 2N = 48 = 18 M + 24SM/ST + 6 T. We used PacBio long-read sequencing and Hi-C technology to assemble a chromosome-level genome of Z. platypus. As a result, an 814.87 Mb genome was assembled with the PacBio long reads. Subsequently, 98.64% assembled sequences were anchored into 24 chromosomes based on the Hi-C data. The chromosome-level assembly contained 54 scaffolds with a N50 length of 32.32 Mb. Repeat elements accounted for 52.35% in genome, and 24,779 protein-coding genes were predicted, with 92.11% were functionally annotated with the public databases. BUSCO analysis yielded a completeness score of 96.5%. This high-quality genome assembly provides valuable resources for future functional genomic research, comparative genomics, and evolutionary studies of genus Zacco.


Background & Summary
Zacco platypus is one of the endemic colorful minnows that are widespread in the freshwater ecosystems of East Asia 1 .It's often used to assess the contaminant on aquatic environment of North Korea, South Korea and China as a test model and indicator species [2][3][4][5] .Recently, for the unique nupitial characteristics (sexual dimorphism and dichromatism, elongated anal fin and nuptial tubercles of the male), Z. platypus has become an important emerging native ornamental fish in China.
Z. platypus has undergone a long and complex taxonomic history.As the type species of genus Zacco, Z. platypus was first described from Nagasaki, Japan 6 .It was successively placed in Cyprinidae, Leuciscinae, Zacco 7 and Cyprinidae, Danioninae, Zacco 8 .After a series of revisions [9][10][11][12] , it is currently assigned into Xenocyprididae, Opsariichthyinae, Zacco 13 .The genus Zacco was established in 1902, and the discriminating criterion for Zacco and Opsariichthys was that "Opsariichthys is presence of peculiar notched jaws, but it is absent in Zacco" 14 .After a series of taxonomic studies [15][16][17][18][19] , the diagnostic features of Zacco were modified to: (1) "the nuptial tubercles on the cheeks are united basally to from a plate in male" commented by Jordan and Hubb 20 ; and (2) the fused light green lateral crossbars into fewer large patches which can be well separate from other members of Opsariichthys 21 .A series of population genetics research using mitochondrial cytochrome b (Cytb) fragments and intron polymorphism revealed that Chinese Z. platypus contained multiple molecular lineages [22][23][24] .The morphological comparisons and genetic analyses using AFLP makers indicated that O. evolans was a valid species, which had been proposed as a synonym of Z. platypus 25 .Particularly, O. evolans and O. acutipinnis had been reported as synonym species of Z. platypus 8 .Therefore, molecular lineages from upper-middle Yangtze and Pearl River basins should be regarded as the members for O. acutipinnis-O.evolans complex.Consequently, it was once thought that the genus Zacco should be restricted only to the type species, Z. platypus, which may be merely distributed from Japan to the north of Zhejiang Province, China 26 .
In the last few years, there have been some new opinions on the taxonomy of Zacco.Molecular analysis based on three nuclear genes suggested that Z. acanthogenys might be a valid species, while no comprehensive diagnostic feature has been reported, other than the red upper iris 27 .More recently, Z. sinensis sp.nov and Z. tiaoxiensi sp.nov were described by using morphological and mitochondrial data 28,29 .Due to its wide distribution, Z. platypus exhibits great morphological flexibility.Our site survey found that the body size and color pattern varied in different river basins of China, and even in different drainages of the same river basin.Limited nuclear genes or mitochondrial markers represent only a small percentage of the genome or are of maternal origin, which may lead to biases when drawing systematic and taxonomic conclusions 30 .Thus, the taxonomy of Zacco is still in debate.To facilitate taxonomic and phylogenetic studies of Zacco fishes, genome-wide genetic information is urgently needed.Although Xu et al. has reported the whole genome of Z. platypus 31 , the chromosome-level genome assembly of this species is still unavailable.Here, we assembled a high-quality chromosome-level genome of East Asia endemic minnow Z. platypus.This new assembly will greatly improve the systematic and taxonomic study of genus Zacco.Furthermore, access to the genomic data set will facilitate the use of Z. platypus as an indicator organism for assessing the contaminant on aquatic environment.

Methods
Sample collection and genome sequencing.A healthy female Z. platypus was collected from Xingtai City, Hebei Province of China (37.0750 °N, 113.9221 °E).High-quality genomic DNA was extracted from muscle tissue for genome libraries construction, and then the library construction and sequencing work were completed at Frasergen Co., Ltd.(Wuhan, China).For short-read sequencing, the Illumina Hiseq X-10 platform (Illumina, San Diego, CA, USA) was used to perform paired-end sequencing with an insert size of 300~350 base pairs (bp).For long-read DNA sequencing, the PacBio sequencing was performed on a PacBio Sequel II platform with continuous long-read (CLR) mode.
To anchor scaffolds onto the chromosome, a chromosome conformation capture (Hi-C) library was prepared using muscle tissue.The Hi-C library was constructed following the standard protocol described previously 32 , and sequenced on an Illumina Hiseq X-10 platform (Illumina, San Diego, CA, USA).
In addition, total RNAs from the tissues of muscle, blood, brain, liver, and spleen were extracted for Iso-Seq using Qiagen RNeasy Mini Kit (Qiagen, Hilden, Germany).The RNA samples from 5 tissues were equally mixed.An Iso-Seq cDNA library was constructed according to the PacBio standard protocol with the BluePippin size selection system (Sage Science, MA, USA) and sequenced on the PacBio sequel II platform.

Karyotypic analysis.
An adult female Z. platypus individual collected from the same location with the sequencing individual was used for karyotyping experiment, according to the published pipeline 33 .Chromosomes were photographed using a Leica DM4 B fluorescence microscope (Leica, Wetzlar, Gemany).Chromosome classifications were made by the standardized nomenclature 34 .The result showed that Z. platypus has a chromosome number of 2n = 48 and a karyotype formula of 18 M + 24SM/ST + 6 T (Fig. 1A).

Genome assembly.
The Illumina sequencing produced 84.32 Gb clean data after the quality control (Table 1).The genome size, repeat content and heterozygosity were estimated by K-mer analysis with Illumina short reads.Frequencies of K-mers (K = 17) were counted using Jellyfish v2.2.6 35 .The genome size was estimated to be approximately 818. 15 Mb, with a heterozygosity of 0.37% and 47.72% of repeat sequences.Then, the genome assembly was conducted with the obtained 172.05 Gb PacBio data using the Falcon assembler v0.3 (Table 1).The draft genome was further polished by gcpp v2.0.2 (https://github.com/PacificBiosciences/gcpp)and pilon v1.22 36 to improve the quality of genome assembly.This preliminary assembly of Z. platypus genome was 814.87 Mb in length with an N50 of 8.10 Mb (Table 2).
Subsequently, 151.17 Gb Hi-C data were aligned to the assembly using the Juicer v1.6.2 37 (Table 1).The contigs were ordered and anchored with Hi-C data using the 3D-DNA 38 and manually adjusted using Juicebox Assembly Tools v1.11.08 39 .Finally, the Hi-C interaction heatmap demonstrated an excellent quality of the genome assembly (Fig. 1B).Approximately 98.64% of the contig sequences were anchored to 24 chromosomes, which is consistent with the karyotype analysis in this study (Fig. 1A).The Circos 40 was used to visualize the 24 chromosomes, GC content, gene density, repetitive sequence density and major interchromosomal syntenic relationships (Fig. 1C).The longest and shortest chromosomes were 46.87 Mb and 25.28 Mb in length, respectively (Table 2).The N50 reached 32.32 Mb for the final genome assembly (Table 2).The assembly completeness was evaluated by Benchmarking Universal Single-Copy Orthologs (BUSCO) v3.0.2 41 with actinopterygii_odb10.We found that 96.30% of BUSCO genes were completely detected in the final assembly.

Repeat annotation.
The repetitive elements in the genome of Z. platypus were annotated by using a combination of homology-based and ab initio approaches.For the homology-based approach, the repeat sequences were identified with RepeatMasker v4.0.9 and RepeatProteinMasker v4.0.9 (http://www.repeatmasker.org/) using Repbase database (http://www.girinst.org/repbase/).For the ab initio approach, RepeatModeler v1.0.11(http://www.repeatmasker.org/RepeatModeler/) and LTR-FINDER software v1.0.5 42 were used to build an ab initio repeat sequence library, and then RepeatMasker v4.0.9 was used to predict repeat sequences.Furthermore, TRF v4.09 43 was used to find tandem repeats in the genome.Finally, a total of 426.68 Mb repetitive sequences were identified by combining the de novo, and homology-based approaches, accounting for 52.35% of the whole genome (Table 3

Gene annotation.
To obtain high quality protein-coding genes of Z. platypus genome, a comprehensive strategy combining homology-based prediction, transcript-based prediction and de novo prediction was employed.For the homology-based prediction, protein sequences from Ancherythroculter nigrocauda (GCA_036281575.1),Danio rerio (GCA_000002035.4),Onychostoma macrolepis (GCA_012432095.1),Carassius auratus (GCA_003368295.1),O. bidens (GWHBEIO00000000) were downloaded from Ensembl database (http://www.ensembl.org)and NGDC database (https://ngdc.cncb.ac.cn/).These sequences were aligned to the Z. platypus genome using Exonerate software 44 .Meanwhile, a total of 77.90 Gb clean data was generated with   Iso-Seq, and 32,860 transcripts with a mean length of 2582 bp were obtained with the Iso-Seq workflow.For the transcript-based prediction, PASA 45 was used to annotate gene structure with the full-length trancripts.For the de novo prediction, the gene structure was identified with Augustus v3.3 46 and GenScan v1.0 47 .All data were then integrated using MAKER2 48 .PASA was used to further refine the gene structure based on transcriptome data and a total of 24,779 protein-coding genes were predicted, with average gene length and exon number per gene of 17,588.54bp and 9.41, respectively (Table 5).Gene function annotation was performed by aligning the genes to several databases, including NCBI Nr, Swiss-Prot 49 , Pfam 50 , GO 51 , KEGG 52 , InterPro 53 , and TrEMBL 54 using BLASTP (e-value ≤ 1e −5 , max_target_seqs 1).Finally, 22,823 genes accounting for 92.11% of the total were successfully annotated with at least one database (Table 6).The annotated genes contained 91.40% complete and 2.70% fragmented BUSCOs using actinoptery-gii_odb10, indicating that the annotation has high completeness.

Data Records
All the raw sequencing data utilized in this study were submitted to the National Center for Biotechnology Information (NCBI) SRA (Sequence Read Archive) database under BioProject accession number PRJNA1028840.Specifically, the Illumina WGS data was archived with the accession number SRR26456191 57 , while the PacBio WGS data was deposited with the accession number SRR26456189 58 .The Iso-Seq and Hi-C data sets were archived under the accession numbers SRR26456188 59 and SRR26456190 60 , respectively.The final chromosome assembly has been deposited at GenBank under the accession number JAYDZZ000000000 61 .The genome annotation file has been deposited at the Figshare 62 .

technical Validation
The quality scores across all bases and GC content of the Illumina raw sequencing data were inspected by FastQC v0.11.9 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).BUSCO v3.0.2 was used for quantitative assessment of genome assembly and evaluating the completeness of protein-coding annotation with the actinopterygii_odb10 41 .

Fig. 1
Fig. 1 Karyotype and genomic information visualization of Zacco platypus.(A) The image and karyotype of Z. platypus.(B) Heat map of interactive intensity between chromosome sequences of Z. platypus anchored by Hi-C.(C) Circos plot of 24 assembled chromosomes for Z. platypus genome.From the outside to the inside, the tracks indicate 24 chromosomes, GC content (bin = 1 Mb), gene density (bin = 1 Mb), repetitive sequence density (bin = 1 Mb) and the major interchromosomal syntenic relationships, respectively.

Table 1 .
Statistics of sequencing data.

Table 3 .
Summary of repetitive sequences.

Table 4 .
Statistics of repetitive sequence classification results.Note: TEs, transposable elements; LINE, long interspersed nuclear elements; SINE, short interspersed nuclear elements; LTR, long terminal repeats.

Table 5 .
Statistics of gene prediction.

Table 7 .
Statistics of noncoding RNA annotation result.