The sequencing and de novo assembly of the Larimichthys crocea genome using PacBio and Hi-C technologies

Larimichthys crocea is an endemic marine fish in East Asia that belongs to Sciaenidae in Perciformes. L. crocea has now been recognized as an “iconic” marine fish species in China because not only is it a popular food fish in China, it is a representative victim of overfishing and still provides high value fish products supported by the modern large-scale mariculture industry. Here, we report a chromosome-level reference genome of L. crocea generated by employing the PacBio single molecule sequencing technique (SMRT) and high-throughput chromosome conformation capture (Hi-C) technologies. The genome sequences were assembled into 1,591 contigs with a total length of 723.86 Mb and a contig N50 length of 2.83 Mb. After chromosome-level scaffolding, 24 scaffolds were constructed with a total length of 668.67 Mb (92.48% of the total length). Genome annotation identified 23,657 protein-coding genes and 7262 ncRNAs. This highly accurate, chromosome-level reference genome of L. crocea provides an essential genome resource to support the development of genome-scale selective breeding and restocking strategies of L. crocea.

In this report, we provided chromosome-level reference genome sequences of L. crocea combining the PacBio single molecule sequencing technique (SMRT) and high-throughput chromosome conformation capture (Hi-C) technologies.
In addition, we also produced a chromosome-level reference genome of Takifugu bimaculatus 9 , which is also cultured as an important food fish in China, via almost the same approach. Both genomes were assembled with high quality, confirming the stability and suitability of this approach for marine fishes. The availability of a fully sequenced and annotated genome is essential to support basic genetic studies and will be helpful to develop genome-scale selective breeding strategies for these important mariculture species.

Methods
Sample collection, library construction and sequencing. A healthy female large yellow croaker belonging to the F1 generation of the "Fufa I" strain was collected from the State Key Laboratory of Large Yellow Croaker Breeding at Ningde, Fujian Province, China, and white muscle samples were collected. The muscle samples were immediately frozen in liquid nitrogen for 30 min and then stored at −80 °C. For high-molecular-weight (HMW) genomic DNA (gDNA) extraction, frozen samples were lysed in SDS digestion buffer with proteinase K. Then, the lysates were purified using AMPure XP beads (Beckman Coulter, High Wycombe, UK) to obtain HMW gDNA. Meanwhile, normal-molecular-weight (NMW) gDNA was extracted from the same samples using the DNeasy 96 Blood and Tissue Kit (Qiagen, Shanghai, China).
A whole-genome shotgun sequencing strategy was employed for genome size estimation and polishing of preliminary contigs. An Illumina library with 250 bp insert size was constructed from NMW gDNA using the standard protocol provided by Illumina (San Diego, CA, USA), and paired-end sequencing was performed using the Illumina HiSeq2500 platform with a read length of 2 × 150 bp. Finally, 105.23 Gb raw reads were generated. All reads containing adaptor sequences were discarded first. After that, uncertain bases (represented by "N") and low-quality bases (Q < 5) were trimmed from the remaining Illumina reads using SolexaQA ++ 10 (version v.3.1.7.1). After trimming, there was a total of 105.01 Gb reads longer than 30 bp remaining, and these were retained as clean reads and used in genome size estimation and preliminary contig polishing (Table 1).
HWM gDNA was used in DNA template preparation for sequencing on the PacBio System following the "Template Preparation and Sequencing Guide" provided by Pacific Biosciences (Menlo Park, CA, USA). The main steps were as follows: extracted DNA was first sheared into large fragments (10 Kbp on average) and then purified and concentrated using AMPure PB beads; DNA damage and ends induced in the shearing step were repaired; blunt hairpins were subsequently ligated to the repaired fragment ends; prior to sequencing, the primer was annealed to the SMRTbell template, and then, DNA polymerase was bound to the annealed templates; finally, DNA sequencing polymerases were bound to the primer-annealed SMRTbell templates.
After sequencing, a total of 9.45 K (80.61 Gbases) long reads were generated from the PacBio SEQUEL platform. The average length and N50 length of these reads were 8,530.75 bp and 12,624 bp, respectively. The genome size of L. crocea was estimated to be 708.47 Mbp using K-mer analysis, and the average sequencing coverage was estimated as 113.78X (Table 1).
Hi-C sequencing was performed parallel to the PacBio sequencing. We used formaldehyde to fix the conformation of the HMW gDNA. Then, the fixed DNA was sheared with MboI restriction enzyme. The 5′ overhangs induced in the shearing step were repaired using biotinylated residues. Following the ligation of blunt-end fragments in situ, the isolated DNA was reverse-crosslinked, purified, and filtered to remove biotin-containing fragments. Subsequently, DNA fragment end repair, adaptor ligation, and polymerase chain reaction (PCR) were performed successively. In the end, sequencing was performed on the Illumina HiSeq2500 platform and yielded a total of 119.15 Gb paired-end reads, with an average sequencing coverage of 168.18X (Table 1).
De novo assembly of the L. crocea genome. In summary, as shown in Fig. 1, reads generated from three different types of libraries were used in three different assembly stages separately: Illumina sequencing data were used in estimation of genome size and polishing of preliminary contigs; PacBio sequencing data were used for preliminary contig assembly; and Hi-C reads were used in chromosome-level scaffolding.
The read pairs generated from the small-insert genomic DNA libraries were filtered out if the proportion of "N" sites exceeded 10%, number of low-quality bases exceeded 75 or the reads were polluted by adaptor sequences. Then, all clean Illumina reads were used to generate 17-mers with a window-sliding-like method. Accordingly, there were 4 17 different 17-mers. After calculating the depth distribution of these 17-mers using Jellyfish 11 (v2.1.3), we could estimate the genome size using Lander/Waterman's equations: www.nature.com/scientificdata www.nature.com/scientificdata/ In these equations, L is read length (150 for Illumina reads), N base and N 17-mer are counts of bases and 17-mers respectively; C base and C k-mer are expected coverage depths of bases and 17-mers, respectively; estimated genome size is represented by G est . As a result, the genome size of L. crocea was estimated to be approximately 708. 47 Mbp.
Long reads generated from the PacBio SEQUEL platform containing adaptor sequences or with a quality value lower than 20 (corresponding to a 1% error rate) were filtered out. The remaining reads were subsequently further processed by self-correction to address sequencing errors using Falcon 12 (version 1.8.2). Thereafter, genome assembly based on these error-corrected reads was processed in three stages: detection of overlaps among input reads and assemble the final string graph 13 using the Falcon pipeline; calling of highly accurate consensus sequences based on PacBio reads using quiver 14 (version 2.1.0); and polishing the preliminary contigs with Illumina reads using pilon 15 (version 1.21). Finally, we obtained a newly assembled genome of L. crocea containing 1,591 contigs with a total length of 723.86 Mb and a contig N50 length of 2.83 Mb (Table 2).
To obtain chromosome-level scaffolds, Hi-C reads were filtered in the same way as we filtered the short-insert library reads and subsequently mapped to de novo assembled contigs to construct contacts among the contigs using bwa 16 (version 0.7.17) with the default parameters. BAM files containing Hi-C linking messages were processed by another round of filtering, in which reads were removed if they were not mapped to the reference genome within 500 bp from the nearest restriction enzyme site. Then, LACHESIS 17 (version 2e27abb) was used for ultra-long-range scaffolding of de novo genome assemblies using the signal of genomic proximity provided by the Hi-C data. In this step, all parameters were set to defaults except that CLUSTER_N, CLUSTER_MIN_ RE_SITES and ORDER_MIN_N_RES_IN_SHREDS were set to 24, 80 and 10, respectively. The parameter CLUSTER_N was used to specify the number of chromosomes. For large yellow croaker, this number was determined to be 24 in previous studies 5,18,19 . Ultimately, we obtained 24 chromosome-level scaffolds constructed from 548 contigs with a total length of 668.67 Mb (92.48% of the total length of all contigs) ( Table 3).

Gene annotation.
To obtain a fully annotated L. crocea genome, three different approaches were employed to predict protein-coding genes. Ab intio gene prediction was performed on the repeat-masked L. crocea genome assembly using Augustus 20 (version 2.5.5), GlimmerHMM 21 (version 3.0.1), Geneid 22 (version 1.4.4) and GenScan 23 (version 1.0). Furthermore, homology-based prediction was performed using protein sequences of three common model species [Danio rerio (Dre) 24 , Homo sapiens (Hsa) 25 , and Mus musculus (Mmu) 26 ] downloaded from European Nucleotide Archive (ENA) and two related species [Oreochromis niloticus (Oni) 27 and www.nature.com/scientificdata www.nature.com/scientificdata/ Notothenia coriiceps (Nco) 28 ]. Subsequently, these protein sequences were mapped onto the generated assembly using blat 29 (version 35) with a cut off of e-value ≤ 1e −5 . GeneWise 30 (version 2.2.0) was employed to align the homologs in the L. crocea genome against the other species for gene structure prediction. In addition, we also applied transcriptome-based prediction by using existing RNA-seq data generated from various tissues including gonad 31 , spleen 32 , liver 33 , muscle 34 , skin 35 , brain 36 and embryos in different developmental stages 37 ( Table 4). The RNA-seq reads were mapped onto the genome assembly using TopHat 38   www.nature.com/scientificdata www.nature.com/scientificdata/ all transcribed genes were predicted by Cufflinks 39 (version 2.2.1) with the default parameters. The predicted gene sets generated from these three approaches were then integrated to produce a non-redundant gene set using EvidenceModeler 40 (version 1.1.0). PASA 41 (version 2.0.2) was then used to annotate the gene structures. As a result, a total of 23,172 protein-coding genes were predicted and subsequently annotated. The average number of exons per gene, and average CDS length were 9,27 and 1465.51 bp, respectively. To identify candidate non-coding RNA (ncRNA) genes, we aligned genome sequences against the Rfam database 42 (version 12.0) using BLASTN to search for homologs. As a result, a total of 7262 ncRNA genes were predicted (1246 miRNAs, 3517 tRNAs, 1758 rRNAs and 741 snRNAs, Fig. 2 and Table 5).
Gene function annotations were conducted against the NCBI nr and SwissProt protein databases, and homologs were called with E values of <1 × 10 −5 . The functional classification of Gene Ontology (GO) categories was performed using the InterProScan program 43 (version 5.26). Kyoto Encyclopedia of Genes and Genomes (KEGG) 44 pathway annotation analysis was performed using the KEGG Automatic Annotation Server (KAAS) 45 . As a result, a total of 23,323 genes could be annotated, accounting for 99.7% of all predicted genes (Fig. 2, and Table 2).  Table 6). A Perl script createRepeatLandscape. pl supplied with RepeatMasker was used to visualize the divergence distribution of TEs in the L. crocea genome (Fig. 3). The numbers and lengths of contigs comprising each chromosome were shown in the outermost track of a Circos 53 plot.

Data Records
This whole genome shotgun sequencing project has been deposited at DDBJ/ENA/GenBank under the accession RQIN00000000. The version described in this paper is version RQIN01000000 54 .  www.nature.com/scientificdata www.nature.com/scientificdata/   www.nature.com/scientificdata www.nature.com/scientificdata/ Genome assembly and annotation have also been deposited at Figshare 55 . All sequencing data, including the PacBio long reads, Illumina short reads and Hi-C reads, have been deposited in the NCBI Sequence Read Archive (SRA) under the accession numbers SRP169057 56 .
The existing RNA-seq datasets are all available in NCBI SRA, with the accession numbers listed in Table 4  Completeness and accuracy of the assembly. The completeness and accuracy of the assembly were further assessed in multiple ways. First, the reads from the short-insert library were re-mapped onto the assembly using bwa 16 59 . There were 232 genes out of the complete set of 233 genes (99.57%) covered by the assembly, suggesting the high completeness of the draft genome of L. crocea (Table 7). Subsequently, Benchmarking Universal Single-Copy Orthologs (BUSCO) software 60 (version 1.22) was executed using actinopterygii_odb9 database to assess the predicted gene set. The genome mode result showed that 97.1% of all 4584 BUSCOs were assembled, including 93.7% and 3.3% of all BUSCOs were completely and partially assembled, also implying a high level of completeness for the de novo assembly (Table 7). In addition, the results generated with protein mode based on all predicted genes showed that 91.2% of all 4584 BUSCOs were assembled, including 11.9% of all BUSCOs that were partially predicted (Table 7).

Code Availability
The execution of this work involved using many software tools. To allow readers to repeat any steps involved in genome assembly and genome annotation, the settings and parameters were provided below:  Table 6. Detailed classification of repeat sequences. Note: "De novo" represents the de novo identified transposable elements using RepeatMasker, RepeatModeler, RepeatScout, and LTR_FINDER. "TE proteins" indicates homologous transposable elements in Repbase identified with RepeatProteinMask, while "Combined TEs" refers to the combined results of transposable elements identified in these two ways. "Unknown" represents transposable elements that could not be classified by RepeatMasker.   Table 7. Details of accuracy and completeness validation of genome assembly.