Chromosome-level assembly of Gymnocypris eckloni genome

Gymnocypris eckloni is widely distributed in isolated lakes and the upper reaches of the Yellow River and play significant roles in the trophic web of freshwater communities. In this study, we generated a chromosome-level genome of G. eckloni using PacBio, Illumina and Hi-C sequencing data. The genome consists of 23 pseudo-chromosomes that contain 918.68 Mb of sequence, with a scaffold N50 length of 43.54 Mb. In total, 23,157 genes were annotated, representing 94.80% of the total predicted protein-coding genes. The phylogenetic analysis showed that G. eckloni was most closely related to C. carpio with an estimated divergence time of ~34.8 million years ago. For G. eckloni, we identified a high-quality genome at the chromosome level. This genome will serve as a valuable genomic resource for future research on the evolution and ecology of the schizothoracine fish in the Qinghai-Tibetan Plateau.

evolution of G. eckloni may shed light on the underlying molecular mechanisms involved in high-altitude adaptations in schizothoracine fish of the QTP.
In the present study, we integrated PacBio long-read sequencing, Illumina short-read sequencing, and high-throughput chromosome conformation capture (Hi-C) technology to generate a high-quality chromosome-level reference genome for G. eckloni. The reference genome obtained in this study will provide a foundation for future investigations on the evolution and adaptation of schizothoracine fish.
Methods experimental fish and sequencing. G. eckloni genomic DNA were extracted from the muscle samples of a healthy female individuals obtained from the Native Fish Artificial Proliferation and Release Station, Xunhua, Qinghai Province, China (Fig. s1). For genome assembly, two libraries with insert sizes of 300 bp and 20 kb were separately constructed using an Illumina TruSeq Nano DNA Library Prep Kit and SMRT bell Template Prep Kit. The two libraries were subsequently sequenced using an Illumina HiSeq X Ten instrument and a PacBio Sequel platform 25 . For the PacBio platform, a total of 312.2 Gb PacBio long sequencing reads were generated, and 239.0 Gb subreads (334.6 × coverage) with an average length of 23,706 bp were obtained after removing adaptors in polymerase reads (Table 1). For the Illumina HiSeq X Ten sequencing platform, a total of 251.7 Gb short sequencing reads were generated. After filtering, 215.2 Gb (231.2 × coverage) of clean Illumina data were retained to perform a genome survey.
To conduct chromosome-level assembly of the G. eckloni genome, a Hi-C library was generated using the Mbo I restriction enzyme following previously described standard protocol with minor modifications 26 . In brief, the purified DNA from the fresh muscle sample was digested with Mbo I restriction enzyme and labelled by incubating with Biotin-14-dATP (Thermo Fisher Scientific, USA), and then ligated by T4 DNA Ligase. After incubating overnight to reverse crosslinks, the ligated DNA was sheared into 200-600 bp fragments, and then blunt-end repaired and A-tailed, followed by purification through biotin-streptavidin-mediated pull down. Finally, the Hi-C libraries were quantified and sequenced on the Illumina NovaSeq6000 platform (Illumina, USA) using a PE-150 module, generating a total of 257.3 Gb (275.8 × coverage) clean data after using the same filter criteria with short reads (Table 1).
To provide evidence of transcripts for genome structure annotation, we conducted RNA-seq for muscle, skin, gill, liver, gut, spleen, kidney, heart, eye and blood samples. RNA was extracted using Ambion MagMAX-96 total RNA isolation kit (Life Sciences, United States) for all samples, and DNase I treatment was performed to eliminate DNA contamination. After the quality assessment of the extracted RNAs using NanoPhotometer ® spectrophotometer (Implen, United States), RNA-seq libraries were constructed according to the protoco and were sequenced by Illumina HiSeq4000 in paired-end 150 bp mode, resulting in a total of 66.43 Gb clean transcriptome data (Table 1).
De novo assembly of G. eckloni genome. We used the k-mer method to survey the genomic features of the G. eckloni. The k-mer count histogram was obtained from Illumina paired-end sequencing data using Jellyfish v2.99 27 . Based on the total number of 169,021,371,761 17-mers and a peak 17-mer depth of 181, the genome size of G. eckloni was estimated to be 927.13 Mb, and the estimated heterozygosity rate was approximately 1.82% (Table s1).
The 239.0 Gb subreads from the PacBio Sequel platform were used for genome assembly using wtdbg2 28 followed by Quiver 29 and Pilon 30 polishing using the 215.2 Gb of Illumina HiSeq clean reads, which produced a 918.45 Mb genome assembly, consisting of 3,170 contigs with a contig N50 size of 4.19 Mb (Table 2).
Hi-C technology was applied to conduct the chromosome-level genome assembly of G. eckloni. Clean reads sequenced from the Hi-C library were aligned to the contig-level genome with an end-to-end algorithm implemented in Bowtie v2.3.5 according to the Hi-C-Pro strategy 31,32 . Juicer v1.6.2 and 3D de novo assembly (3D-DNA) pipelines were used to assemble the contigs into the chromosome-level genome 33,34 . Ultimately, the assembled sequences were further anchored and orientated onto 23 pseudo-chromosomes using Hi-C data. The 23 pseudo-chromosomes ranged in size from 15.91 to 89.39 Mb ( Fig. 1 and Table s2), covering ~98.52% of the whole genome. Finally, the G. eckloni genome was obtained with 711 scaffolds and a total length of 918,681,488 bp, a contig N50 of 4.19 Mb, and scaffold N50 of 43.54 Mb ( Table 2).
The completeness of the genome assembly was assessed by the single copy orthologs (BUSCO, version 5.3.2) 35 and CEGMA 36 software. The BUSCO analysis based on the actinopterygii_odb10 database showed that 87.5% (single-copy genes: 83.0%, duplicated genes: 4.5%) of the 3,640 single-copy genes were identified as complete, 1.3% were fragmented, and 11.2% were missing from the assembled genome. The CEGMA analysis revealed that 221 conserved genes (89.11% of the core eukaryotic genes) supported the completeness of the assembled genome. Illumina short reads were mapped to the assembled genome using BWA 37  www.nature.com/scientificdata www.nature.com/scientificdata/ evaluate completeness of the genome assembly. The results showed that 93.40% of the reads could be mapped, covering 96.34% of the assembled genome.
Repetitive element and non-coding gene annotation in the G. eckloni genome. A combined strategy using homology alignments and de novo searches to identify whole-genome repeats was applied in our repeat annotation pipeline. Tandem repeats were extracted using TRF (http://tandem.bu.edu/trf/trf.html) by ab initio prediction. For homolog prediction, Repbase (http://www.girinst.org /repbase) employing RepeatMasker (http://www.repeatmasker.org/) software and its in-house scripts (RepeatProteinMask) with default parameters was used to extract repeat regions. Additionally, ab initio prediction based on the de novo repetitive elements database was conducted by LTR_FINDER (http://tlife.fudan.edu.cn/ltr_finder/), RepeatScout (http:// www.repeatmasker.org/), and RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html) with default parameters. Then, all repeat sequences with lengths > 100 bp and gap 'N' < 5% were used to construct the raw transposable element (TE) library. A custom library (a combination of Repbase and our de novo TE library, which was processed by uclust to yield a non-redundant library) was supplied to RepeatMasker for DNA-level repeat identification. The results showed revealed that 47.63% of the G. eckloni genome was annotated as repetitive elements (Table s3), of which LTRs were the most abundant with a total length of 356.79 Mb, accounting for 38.84% of the whole genome. SINEs were the rarest with a total length of 2.37 Mb and represented 0.26% of the whole genome (Table s4).
The tRNAs were predicted using tRNAscan-SE (http://lowelab.ucsc.edu/tRNAscan-SE/), and the rRNA sequences were predicted using BLAST. The results showed that a total of 12,157 tRNAs were predicted using tRNAscan-SE, and 1,780 rRNA genes were annotated using BLASTN tool with an E-value of 1E-10 32 against human rRNA sequence. Other ncRNAs, including miRNAs and snRNAs, were identified by searching against the Rfam database with default parameters using infernal software (http://infernal.janelia.org/) ( Table s5). annotation of protein-coding genes. Gene predictions were conducted through a combination of homology, de novo, and transcriptome-based prediction methods. For homology-based predictions, the protein sequences of seven fish species, including Oryzias latipes, Ctenopharyngodon idellus, Ictalurus punctatus, Cyprinus carpio, Takifugu rubripes, Danio rerio, and Astyanax mexicanus, were downloaded from Ensembl database (http:// asia.ensembl.org/index. html). Protein sequences were aligned to the genome using TblastN v2.2.26 with an e-value of 1e −5 38 . Then, matching proteins were aligned to homologous genome sequences for accurate spliced alignments using GeneWise v2.4.1 39 (referred to "Homolog" in Table 3), which was subsequently used to predict gene structure of each protein region. RNA-sequencing data derived from nine tissues and blood samples were assembled using Trinity v2.1.1 40 , and were aligned against the G. eckloni genome using Program to Assemble Spliced Alignment (PASA) 41 (referred to "PASA" in Table 3). To optimize genome annotation, RNA-seq reads from different tissues were aligned to G. eckloni genome fasta using TopHat package v2.0.11 with default parameters to identify exons region and splice positions 42 . The alignment results were then used as inputs for Cufflinks package v2.2.1 with default parameters for genome-based transcript assembly 43 (referred to "Cufflinks"in Table 3). Finally, EvidenceModeler v1.1.1 was used to combine the gene models into weighted consensus gene structures with masked repetitive elements 41 . Additionally, PASA was used to update the final gene models, thereby adding information of alternatively spliced sites and untranslated regions (UTR) (referred to "Pasa-update" in Table 3). Ultimately, a total of 24,430 protein-coding genes were predicted in the G. eckloni genome. The average transcript length was 16,219.34 bp with an average coding sequence (CDS) length of 1,536.71 bp. The average exon number per gene was 8.88 with an average exon length of 173.00 bp and average intron length of 1,862.69 bp ( Table 3). The statistics of gene models, including lengths of a gene, CDS, intron, and exon in G. eckloni were comparable to those for close-related species (Table s6 and Fig. 2).
Public biological function databases of NR, SwissProt 44 , InterPro 45 , and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases 46 were used for the functional annotation of protein-coding genes using BLASTX and BLASTN utilities 46 with an e-value threshold of 1e −5 . InterPro database was used to predict protein function based on the conserved protein domains by InterproScan tool 47 . A total of 23,157 genes (94.8%) were successfully annotated by at least one public database (Table s7 and Fig. 3).

Length
No.  www.nature.com/scientificdata www.nature.com/scientificdata/ evolutionary and comparative genomic analysis. To examine G. eckloni evolution, we used OthoMCL 48 to cluster its genes with those from 13 other vertebrates: Astyanax mexicanus, Ictalurus punctatus, Danio rerio, C. carpio, Ctenopharyngodon idella, Oreochromis niloticus, Oryzias latipes, Takifugu rubripes, Gallus gallus, Homo sapiens, Mus musculus, Xenopus tropicalis, and Petromyzon marinus. From these 14 species, we identified 597 one-to-one single-copy genes that were used to construct a maximum likelihood (ML) tree using RaxML with the GTRGAMMA model 49 . Divergence times between species were calculated using the MCMC tree program implemented by PAML package 50 . According to the time-calibrated phylogeny, the age of the most recent common ancestor (MRCA) of the teleost fish was estimated to be 211.8-254.1 million years ago. The G. eckloni with the closest relationship to C. carpio shared an MRCA at ~ 34.8 million years ago (Fig. 4).   www.nature.com/scientificdata www.nature.com/scientificdata/ A total of 24,619 gene families were identified among the 14 species (Table s8), of which 2,739 core gene families were shared by all 14 species and 856 gene families is unique for G. eckloni including 1,488 genes. Analysis of the expansion and contraction of the gene families revealed that there were 464 (1650 genes) expanded and 743 (192 genes) contracted gene families in G. eckloni when compared to its MRCA (Fig. 4). The expanded gene families included ABC transporters, Peroxisome, Herpes simplex virus 1 infection, Staphylococcus aureus infection, Axon guidance, Dorso-ventral axis formation, Pertussis, Legionellosis, Rap1 signaling pathway and so on, and the contracted gene families included Tight junction, Systemic lupus erythematosus, Pathogenic Escherichia coli infection, Gap junction, Alcoholism, Pertussis, Ascorbate and aldarate metabolism, NOD-like receptor signaling pathway and so on.

Data Records
All raw data of the whole genome have been deposited into the National Center for Biotechnology Information (NCBI) SRA database (Experiments for SRP377513) under BioProject accession number PRJNA835611 51    www.nature.com/scientificdata www.nature.com/scientificdata/ Technical Validation RNa integrity. The transcriptomes for nine tissues and blood from three fish individuals were sequenced.
Before constructing RNA-Seq libraries, RNA purity was analyzed with a NanoPhotometer Spectrophotometer (Implen, United States). The RNA concentration was quantified with a Qubit RNA Assay Kit in a Qubit 2.0 Fluorometer (Life Technologies, United States). RNA integrity was analyzed using a RNA Nano 6000 Assay Kit and an Agilent Bioanalyzer 2100 (Agilent Technologies, United States). The total amount of RNA, RNA integrity and rRNA ratio were used to estimate the quality, content and degradation level of RNA samples. In the present study, RNAs samples with a total RNA amount ≥ 10 μg, RNA integrity number ≥ 8, and rRNA ratio ≥ 1.5 were finally subjected to construct the sequencing library.
Comparative genomic analyses. The 48 . Only the longest predicted transcript per locus was retained. In the all-against-all BLASTP comparisons, a cutoff e-value of 1e −5 was used. The MCL inflation index was set to 1.5.
According to the divergence times and phylogenetic relationships, CAFÉ was used to analyze the expansion and constriction of gene families in the G. eckloni genome based on the gene families identified by OrthoMCL 55 . The phylogenetic tree topology and branch lengths were taken into account when inferring the significance of change in the gene family size of each branch. Enrichment analyses based on the Gene Ontology (GO) and KEGG annotations were performed to identify the functional implications of expanded and contracted genes (Fisher's exact test, adjusted p-value < 0.05).

Code availability
All software used in this work is in the public domain, with parameters being clearly described in Methods. If no detail parameters were mentioned for a software, default parameters were used as suggested by developer. Fig. 4 Phylogenetic tree based on single-copy genes from 14 species shows the estimated divergence time (blue numbers), topology and expansion (green numbers), and contraction (red numbers) of gene families.