The sequence and de novo assembly of Takifugu bimaculatus genome using PacBio and Hi-C technologies

Takifugu bimaculatus is a native teleost species of the southeast coast of China where it has been cultivated as an important edible fish in the last decade. Genetic breeding programs, which have been recently initiated for improving the aquaculture performance of T. bimaculatus, urgently require a high-quality reference genome to facilitate genome selection and related genetic studies. To address this need, we produced a chromosome-level reference genome of T. bimaculatus using the PacBio single molecule sequencing technique (SMRT) and High-through chromosome conformation capture (Hi-C) technologies. The genome was assembled into 2,193 contigs with a total length of 404.21 Mb and a contig N50 length of 1.31 Mb. After chromosome-level scaffolding, 22 chromosomes with a total length of 371.68 Mb were constructed. Moreover, a total of 21,117 protein-coding genes and 3,471 ncRNAs were annotated in the reference genome. The highly accurate, chromosome-level reference genome of T. bimaculatus provides an essential genome resource for not only the genome-scale selective breeding of T. bimaculatus but also the exploration of the evolutionary basis of the speciation and local adaptation of the Takifugu genus.

Library construction and sequencing. A genome survey was performed based on Illumina short reads for estimating genome size, heterozygosity and repeat content, which provides a basic evaluation before we started the large scale whole genome sequencing. A library with a 350 bp insert size was constructed from NMW gDNA following the standard protocol provided by Illumina (San Diego, CA, USA). The library was then sequenced with a paired-end sequencing strategy using the Illumina HiSeq 2500 platform, and the read length was 2 × 150 bp. Finally, ~53.43 Gb raw data were generated. After removing the low-quality bases and paired reads with the Illumina adaptor sequence using SolexaQA++ 6 (version v.3.1.7.1), a total of ~53.28 Gb clean reads, were retained for the genome survey (Table 1).
For the preparation of the single-molecule real-time (SMRT) DNA template, the HMW gDNA was sheared into large fragments (10 K bp on average) by ultrasonication and then end-repaired according to the manufacturer's instructions (Pacific Biosciences). The blunt hairpins and sequencing adaptor were ligated to the DNA fragments, DNA sequencing polymerases were bound to the SMRTbell templates. Finally, the library was quantified using a Qubit 4 Fluorometer (Invitrogen, USA). After sequencing with the PacBio SEQUEL platform at Novogene (Tianjin), a total of 3.86 Million (~28.97 Gb) long reads were generated and used for the following genome assembly. The average and N50 length of the subreads sequences were 7,505 bp and 12,513 bp, respectively. According to the genome survey, the genome size of T. bimaculatus was estimated to be 393.15 Mb; therefore, the average sequencing coverage was 73.69× (Table 1).
For Hi-C sequencing, the Mbol restriction enzyme was used to digest the HMW gDNA after fixing the conformation of HMW gDNA by formaldehyde, after which the 5′ overhangs were repaired with biotinylated residues. The isolated DNA was reverse-crosslinked, purified and filtered for biotin-containing fragments after blunt-end ligation in situ. Thereafter, the DNA was sheared into fragments by ultrasonication and subsequently repaired by T4 DNA polymerase, T4 polynucleotide kinase and Klenow DNA polymerase. Then, dATP was attached to the 3′ ends of the end-repaired DNA, and 300-500 bp fragments were retrieved by Caliper LabChip Xte (PerkinElmer, USA). The DNA concentration was quantified by a Qubit 4 Fluorometer, and the Illumina Paired-End adapters were ligated to the DNA by T4 DNA Ligase. The 12-cycle PCR products were purified by AMPureXP beads. Finally, sequencing of the Hi-C library was performed on an Illumina HiSeq 2500 platform and yielded a total of 128.64 Gb paired-end raw reads, with an average sequencing coverage of 117.80X (Table 1).
The cDNA library was prepared following the protocols of the Illumina TruSeq RNA Sample Preparation Kit (Illumina, San Diego, CA, USA) and quantitated with KAPA Library Quantification Kits. Then, sequencing of RNA-seq was performed on an Illumina HiSeq 2500 platform with a 150 bp paired-end strategy. Finally, we generated 21.35 Gb paired-end raw reads and 20.95 Gb paired-end clean reads for gene structure annotation (Table 1). www.nature.com/scientificdata www.nature.com/scientificdata/ de novo assembly of the T. bimaculatus genome. Reads from the three types of libraries were used in different assembly stages separately (Fig. 1). Illumina sequencing data, PacBio sequencing and Hi-C reads were used for the genome survey, contig assembly and chromosome-level scaffolding, respectively.
In the genome survey, paired reads with "N" sites exceeding 8 or low-quality (Q < 5) bases exceeding 60 were filtered out from the Illumina library. The pair reads containing the Illumina adaptor sequence were also filtered. Using Jellyfish 7 , the frequency of 17-mers in the Illumina clean data was calculated with a 1 bp sliding window using the established method 8 and obeyed the theoretical Poisson distribution (Fig. S2). Finally, the proportion of heterozygosity in the T. bimaculatus genome was evaluated as 0.55%, and the genome size was estimated as 393.15 Mb, with a repeat content of 25.29% (Table S2).
Long reads generated from the PacBio SEQUEL platform were subsequently processed by a self-correction of errors using FALCON 9 . Based on the Overlap-Layout-Consensus algorithm, we detected overlaps from input reads and assembled the final String Graph 10 . Subsequently, we used the FALCON-unzip pipeline to generate phased contig sequences for further calling highly accurate consensus sequences using variantCaller in the GenomicConsensus package, which was employed as an arrow algorithm, and contigs were polished using Illumina reads by Pilon 11 . Finally, we obtained the assembled genome of T. bimaculatus, which contained including 2,193 contigs with a total length and contig N50 length of 404.21 Mb and 1.31 Mb, respectively ( Table 2).
For chromosome-level scaffolding, we first filtered Hi-C reads with the same protocol as Illumina reads. Subsequently, we mapped the Hi-C clean reads to the de novo assembled contigs by using BWA 12 (version 0.7.17) with the default parameters. We removed the reads that did not map within 500 bp of a restriction enzyme site. Using LACHESIS 13 (version 2e27abb), we assembled chromosome-level scaffolding based on the genomic proximity signal in the Hi-C data sets. In this stage, all parameters were default except for CLUSTER_N, ORDER_MIN_N_RES_ IN_SHREDS and CLUSTER_MIN_RE_SITES, which set as 22, 10 and 80, respectively. As a result, we generated 22 chromosome-level scaffolds containing 1,242 contigs (56.63% of all contigs) with a total length of 371.68 Mb (91.95% of the total length of all contigs), and the lengths of chromosomes ranged from 10.38 Mb to 28.86 Mb (Table 3).
Repeat sequences and gene annotation. We identified repeat sequences in the T. bimaculatus genome with a combination of homology-based and de novo approaches using previously established protocol 14 . For the homology-based approach, we used Tandem Repeats Finder 15 (version 4.04) to detect tandem repeats and used RepeatModeler 16 (version 3.2.9), LTR_FINDER 17 (version 1.0.2) and RepeatScout 18 (version 1.0.2) synchronously to detect repeat sequences in the T. bimaculatus genome. Combined with Repbase 19 (Release 19.06), a repeat sequence library was constructed with these results using USEARCH 20 (version 10.0.240). Then, we used RepeatMasker 16 (version 3.2.9) to annotate repeat elements based on this library. In another approach, we utilized Repbase 19 and a Perl script included in the RepeatProteinMasker (submodule in Repeatmasker) program with www.nature.com/scientificdata www.nature.com/scientificdata/ default parameters to detect TE proteins in the T. bimaculatus genome. Finally, after removing redundancies, we combined all the results generated by these methods, and a total of 109.92 Mb (27.2% in the T. bimaculatus genome) sequences were identified as repeat elements (Table 4). Among these repeat elements, long interspersed nuclear elements (LINEs) were the main type, accounting for 12.31% (49.76 Mb). In addition, regarding other repeat elements, there were 24.46 Mb (6.05%) of DNA transposons, 1.19 Mb (0.29%) of short interspersed nuclear elements (SINEs) and 31.55 Mb (7.8%) of long terminal repeats (LTRs) (Figs 2a and 3a Table 4).
For gene structure prediction, we used both homology-based and de novo strategies to predict genes in the T. bimaculatus genome. For homology-based prediction, we mapped the protein sequences of Oryzias latipes 21 , Gasterosteus aculeatus 22 , Tetraodon nigroviridis 23 , Takifugu rubripes 24 and Oreochromis niloticus 25 onto the generated assembly using BLAT 26 (version 35) with an e-value ≤ 1e-5. Then, we used GeneWise 27 (version 2.2.0) to align the homologous in the T. bimaculatus genome against the other five teleosts for gene structure prediction. In the de novo approach, we used several software packages, including Augustus 28     www.nature.com/scientificdata www.nature.com/scientificdata/ gene( Fig. 3b and Table 5). For the annotation of candidate non-coding RNA (ncRNA), we used BLASTN 37 to align the T. bimaculatus genome against the Rfam database 38 (version 12.0). As a result, we annotated 1,666 miRNA, 753 tRNA, 928 rRNA and 1162 snRNA genes ( Fig. 2a and Table 4).
For gene function annotation, we used BLASTP to align the candidate sequences to the NCBI and Swissport protein databases with E values < 1 × 10 −5 . Then, we performed the functional classification of GO categories with the InterProScan program 39 (version 5.26) and used KEGG Automatic Annotation Server (KAAS) 40 to conduct the KEGG pathway annotation analysis. A total of 21,098 genes were successfully annotated, accounting for 99.9% of all predicted genes (Figs 2a, 3c and Table 5).

Data Records
The raw sequencing reads of all libraries are available from NCBI via the accession numbers SRR8285219-SRR8285227 41 . The assembled genome and sequence annotations are available in NCBI with the accession number SWLE00000000 via the project PRJNA508537 42 .

technical Validation
Evaluating the completeness of the genome assembly and annotation. The final assembly contains 404.41 Mb with a scaffold N50 size of 16.79 Mb ( Table 2). Assembly completeness and accuracy were evaluated by multiple methods. First, reads from the short-insert library were re-mapped onto the assembled genome using BWA 12 (version 0.7.17). A total of 96.97% of the reads mapped to a reference sequence in the genome (98.71% coverage), demonstrating a high assembly accuracy (Table S3). We used Genome Analysis Toolkit 43 (GATK) (version 4.0.2.1) to identify a total of 1,115.45 SNPs throughout the whole genome, including 1,110.69 K heterozygous SNPs and 4,765 homozygous SNPs (Table S4). In addition, the accuracy of the assembly was verified by the extremely low proportion of homozygous SNPs (1.22 × 10 −5 %) (Table S4).
Assembly completeness was evaluated using Core Eukaryotic Genes Mapping Approach (CEGMA) software 44 (version 2.3), and a total of 235 core Eukaryotic Genes (CEGs) from the complete set of 248 CEGs (94.67%) were identified in the assembled genome, suggesting the draft genome of T. bimaculatus was high complete (Table S4). Finally, Benchmarking Universal Single-Copy Orthologues (BUSCO) software 45 (version 1.22) was used to evaluate the completeness of the assembly with the actinopterygii_odb9 database. A total of 4,254 out of the 4,584 searched BUSCO groups (92.8%) had been completely assembled in our draft genome, suggesting a high level of completeness of the de novo assembly (Table S3).
To verify the accuracy of the contig arrangement in 22 chromosomes, we aligned 7,443 (count) 1 K bp small fragments with 50 K bp spacing as anchors of the assembled genome against the published T. rubripes genome (FUGU5) 24,46 to compare consistency between these two genomes. The 22 chromosomes we identified in the T. bimaculatus genome aligned exactly against the chromosomes of the T. rubripes, suggesting high continuity with the T. rubripes genome (Fig. 2b).  Table 4. Classification of repeat elements and ncRNAs in T. bimaculatus genome. Note: "Denovo" represented the de novo identified transposable elements using RepeatMasker, RepeatModeler, RepeatScout, and LTR_FINDER. "TE protein" meant the homologous of transposable elements in Repbase identified with RepeatProteinMask. While "Combined TEs" referred to the combined result of transposable elements identified in the two ways. "Unknown" represented transposable elements could not be classified by RepeatMasker.
www.nature.com/scientificdata www.nature.com/scientificdata/ The predicted gene models we used were integrated by EvidenceModeler, and a total of 18,706 genes were predicted by all three gene structure prediction strategies, which representing 88.58% of the 21,117 predicted genes (Fig. 3b). Notably, this validation procedure is limited by the gene expression in the mixture of tissues used for RNA-Seq. Therefore, considering that transcriptomic data derived from different tissues will cover distinct sets of expressed genes, it is conceivable that more genes could be validated.

Gene family identification and phylogenetic analysis of T. bimaculatus.
To identify gene families among T. bimaculatus and other species, we download the protein sequence of Branchiostoma belcheri 47 (outgroup), Ciona intestinalis 48 (outgroup), Danio rerio 49 , Gadus morhua 50 , Gasterosteus aculeatus 22 , Latimeria chalumnae 51 , Lepisoteus oculatus 52 , Mola mola 53 , Oryzias latipes 21 , Oreochromis niloticus 25 , Takifugu rubripes 24 and Tetraodon nigroviridis 23 . We removed those protein sequences shorter than 30 amino acids in the proteome set of the above thirteen species and used OrthoMCL 54 to construct gene families. A total of 20,741 OrthoMCL families were built using the previously all-to-all BLASTP strategy 55 .
To reveal the phylogenetic relationships among T. bimaculatus and other species, we identified 1,479 single copy ortholog families from the 13 species (as described above) (Table S5) and aligned the protein sequences of these 1,497 orthologues using MUSCLE (version 3.8.31) 56 . Then we used Gblocks 57 to extract the well-aligned regions of each gene family alignment and converted protein alignments to the corresponding coding DNA sequence alignments using an in-house script. For each species, we combined all translated coding DNA sequences to a "supergene". Finally, we used RAxML (version 8.2.12) 58 with 500 bootstrap replicates to generate trees. Using molecular clock data from the TimeTree database 59 , MCMCTREE (PAML package) 60 were employed to estimate the divergence time based on the approximate likelihood calculation method. The phylogenetic relationships among the other fish species were consistent with several previous studies 8,14,61 . Based on the phylogenetic analysis, we inferred that T. bimaculatus speciated approximately 9.1 million years ago from the common ancestor of Takifugu (Fig. 4).

Code availability
The versions, settings and parameters of the software used in this work are as follows: Genome assembly:     Table 5. Gene structure and function annotation in T. bimaculatus genome.