Background & Summary

Oplegnathus fasciatus (commonly known as rock bream, barred knifejaw or striped beakfish), is a fish belonging to the family Oplegnathidae. Those common names are derived from its phenotypic features. Rock bream is a subtropical and carnivorous species and is an economically important teleost fish in East Asia1. Generally, the rock bream inhabits estuaries at various depths according to their growth stage, i.e., as juveniles, they are mostly found in drifting seaweed/algae, and as adults, they are present at depths of 1 to 10 meters1. Moreover, the species growth depends on the photoperiod2. Other factors, such as overfishing and environmental changes, are affecting fish yield and cost, particularly in wild conditions. To overcome these issues, O. fasciatus is propagated via aquaculture to achieve sustainable and cost-effective production. In 2008, the annual production of O. fasciatus in South Korea was 614 tons, and that figure had increased to 909 tons in 20163. However, bacterial and viral diseases cause an enormous economic loss in the Korean aquaculture industry3. As a consequence, the scientific community continues to seek various solutions, including molecular genetic applications, to overcome those problems. Some examples of these applications include genetic breeding4, QTL marker identification5, characterization of immunological pathway genes, proposed sex determination6, sex chromosomal evolution models6, antimicrobial peptides7,8, and vaccine development9.

More and more often, advances in molecular sequencing technologies are supporting the scientific community in uncovering the inherited molecular mechanisms of a given species, rather than depending on its model organism10. In this study, we constructed a draft genome for O. fasciatus using next-generation sequencing (NGS) (Fig. 1), which could aid in functional characterization of O. fasciatus-associated problems.

Figure 1: Illustration of the complete Oplegnathus fasciatus genome assembly and the structural and functional annotation pipelines used.
figure 1

(a) the genome assembly pipeline, (b) the structural and functional annotation pipeline, (c) details of the reference gene sets used for the ab initio and evidence-based gene model predictions.

The O. fasciatus genome size is estimated to be ~749 Mb (Fig. 2a) and was assembled into scaffolds with a total size of 762 Mb. Initially, the 224 Gb Illumina library (Table 1) assembled into 108,639 contigs and 31,533 scaffolds. Although the assembled scaffolds are larger than the estimated genome size, it is highly fragmented (Table 2). Therefore, the inclusion of 11.5 Gb of PacBio sequences in the second assembly improved the quality of the overall draft genome when compared to the initial assembly (Fig. 1a). This addition resulted in a 766 Mb draft genome with 4,149 scaffolds, along with improvements to the N50 (0.87 Mb to 1.1 Mb) and to the gaps (5.3% to 5.2%) (Table 2). Furthermore, the repeats were predicted by the de novo method were classified into subclasses (Table 3). In total, 180 Mb (23.56%) of genomic regions consist of repeat sequences, and it is masked in the genome.

Figure 2: Illustration of the genome size and the functional annotation of the Oplegnathus fasciatus genome.
figure 2

(a) k-mer based genome size estimation, (b) sequence similarity-based species distribution obtained from BLAST.

Table 1 Summary of the complete sequence libraries used in this study.
Table 2 Oplegnathus fasciatus genome de novo assemblies.
Table 3 Repeat elements present in the Oplegnathus fasciatus genome.

A total of 334.3 Gb of mRNA transcriptome sequences from 34 libraries (313.8 Gb of Illumina data and 20.5 Gb of Iso-Seq data) was used for the EVM, and seven genomes were used in the ab initio gene modeler. These analyses predicted 23,338 genes and 24,053 transcripts, and 23,362 (97%) of those transcripts were annotated from biological databases. Moreover, the completeness score produced from CEGMA indicated that 220 (88.7%) eukaryotic core genes are entirely mapped to the genome. Therefore, these results clearly show that the given draft genome could be a near-complete reference genome for O. fasciatus. Moreover, these scaffolds will act as a primary genetic resource for O. fasciatus that can be used to design functional studies, and the annotated transcripts (97%) will aid in detailed characterizations. Finally, based on a literature survey and author knowledge, this is the first draft genome presented to the public from the family of Oplegnathidae; therefore, these data could be a valuable asset for marine researchers.

Methods

Sample collection and genomic DNA extraction

A single rock bream fish (95 ± 5 g) was supplied by the Gyeongsangnam-do Fisheries Resources Research Institute (FRRI) (Tongyeong, Republic of Korea) and was maintained at 22 ± 0.5 °C in aerated seawater. Liver tissue was taken from the fresh rock bream aseptically and stored in liquid nitrogen for the extraction of the genomic DNA. The genomic DNA was extracted using a DNeasy Animal Mini Kit (Qiagen, Hilden, Germany). A total of 24 μg of DNA was quantified using the standard procedure for the Quant-iT PicoGreen ds-DNA Assay Kit (Molecular Probes, Eugene, OR, USA) with a Synergy HTX Multi-Mode Reader (Biotek, Winooski, VT, USA). The quality of the DNA was also checked using an ND-1000 spectrophotometer (Thermo Scientific, Wilmington, DE, USA).

DNA library preparation and sequencing

High-quality high molecular-weight genomic DNA > 100 kb in length was isolated from the given tissues, and two protocols were used to construct the sequencing libraries according to the manufacturer protocols, i.e., Illumina paired-end (PE) and mate pair (MP) libraries, (Illumina, San Diego, CA, USA). Furthermore, these libraries were fragmented and size-selected for Illumina Hi-Seq sequencing (Table 1). To obtain long non-fragmented sequence reads from the libraries, the PacBio manufacturing protocols were used (Pacific Biosciences, CA, USA) with 14 cells, and the sequencing used the P6-C4 chemistry of the PacBio RS II system (Table 1).

Preprocessing and genome size estimation

The entire Illumina DNA sequences were subjected to pre-processing steps, which included adapter trimming, quality trimming (Q20) and contamination removal. The adapter and quality trims were conducted by using Trimmomatic-0.32 functions11, and the microbial contamination of each sample was removed by CLCMapper v4.2.0 (https://www.qiagenbioinformatics.com/products/clc-assembly-cell/) with an in-house database. Here, the in-house database was constructed from the meta-genomes (bacteria (ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt), virus (ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/) and marine metagenomes (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA13694). Similarly, mate pair sequences were also subjected to adapter and quality trimming, and classification of the mate pairs was performed using the Nextclip v1.1 method12. All the pre-processed sequences (Insert size: 550 bp, 35 Gb) from the paired-end library (Data Citation 1) were subjected to genome size estimation using the k-mer based method (which was used in the panda genome13). The k-mer frequencies (k-mer size = 19) were obtained using the Jellyfish v2.0 method14, and the genome size was calculated from the given formulas: Genome Coverage Depth = (k-mer Coverage Depth × Average Read Length)/(Average Read Length – k-mer size + 1) and Genome size = Total Base Number/Genome Coverage Depth. Alternatively, the PacBio sequences were only subjected to error correction using CLCAssemblyCell v4.2.0 (Fig. 1a).

De novo Genome Assembly and Scaffolds

The draft genome was built from two type of assemblies, i.e., short-read assemblies and hybrid assemblies. Initially, the complete pre-processed paired-end DNA sequences were subjected to CLCAssemblyCell v4.2.0 to build the contigs. Furthermore, it was scaffolded with mate-pair sequences using the SSPACE v3.0 method15, and the hybrid assembly was built with the SSPACE-LongRead v1.0 method16 from the scaffolds along with the processed PacBio sequences. Next, the hybrid scaffolds were subjected to gap filling with paired-end and mate pair libraries using the GapFiller 1.11 method17. Finally, the gene completeness was assessed using CEGMA18 (Fig. 1a).

De novo repeat region prediction and classification

Initially, repeat regions were predicted using the de novo method and classified into repeat subclasses (Table 3). The de novo repeat prediction for O. fasciatus was conducted using RepeatModeler (http://www.repeatmasker.org/RepeatModeler/), which includes other methods such as RECON19 (http://eddylab.org/software/recon/), RepeatScout20 (https://bix.ucsd.edu/repeatscout/) and TRF21 (https://tandem.bu.edu/trf/trf.html). Furthermore, the repeats were masked using RepeatMasker v4.0.5 (http://www.repeatmasker.org/) with RMBlastn v2.2.27+ and classified into their subclasses using the Repbase22 v20.08 databases for reference (https://www.girinst.org/repbase/).

Gene prediction and annotation

The genes from the O. fasciatus draft genome were predicted using an in-house gene prediction pipeline, which includes three modules: an evidence-based gene modeler (EVM), an ab initio gene modeler and a consensus gene modeler. Finally, the functional annotation processing was conducted for the consensus genes (Fig. 1b). The details of this pipeline were previously explained in articles on the genomes of Capsicum23 and Haliotis24. Initially, the sequenced transcriptomes from two sequencers (Illumina (313.8 GB) and IsoSeq (27.7 GB)) were mapped to the O. fasciatus repeat-masked draft genome using Tophat25, and the transcript/gene structural boundaries were predicted using Cufflink25 and PASA26. To train the ab initio gene modeler and the EVM (which includes Exonerate27, AUGUSTUS28, and GENEID29), several genomes (Gasterosteus aculeatus, Oreochromis niloticus, Tetraodon nigroviridis, Takifugu rubripes, Oryzias latipes, Danio rerio, and Homo sapiens) were used for prediction. Finally, the predicted gene and transcripts models from the EVM and ab initio modeler were subjected to the consensus gene modeler (which includes EVidenceModeler30) to produce the final gene and transcript models. Finally, the consensus transcripts were subjected to functional annotation from biological databases (NCBI - NR databases, Uniprot, Gene Ontologies and KEGG pathways) by using Blast2GO31 (Fig. 1b). From this annotation, 50% of the genes are highly similar to Larimichthys crocea (Fig. 2b).

Code availability

Throughout this study, we were not used any custom specific codes. The command line at each step were executed as instructed in the respective bioinformatics methods.

Data Records

The entire data set used for draft assembly and its corresponding functional and structural annotations were deposited in public repositories. The DNA sequence libraries were deposited in NCBI (Data Citation 1) and see Table 1 for the details. The final assembly super-scaffold were submitted to NCBI Assembly (Data Citation 2) and see Table 2 for details. Moreover, the other files, such as the assembled contigs, scaffolds, and annotation tables, were stored in figshare (Data Citation 3) and see Table 4 for the details.

Table 4 Datasets for this project submitted to the figshare repository and its data descriptions.

Technical Validation

Throughout this study, every step was validated with the given metrics. The sampled fish were cultured under controlled conditions in the FRRI. Furthermore, the sequence libraries were quantified with different parameters. For Illumina, the isolated DNA spectrophotometer ratios (SP) were 260/280 ≥ 1.6 and total DNA ≥ 1.1 μg with minimum 20 ng/μl, and for PacBio, the SP was 260/280 ≥ 1.6 and 260/230 ≥ 2.0 and total DNA ≥ 15 μg with minimum 200 ng/μl. Moreover, the default parameters were used in the bioinformatics methods.

Additional information

How to cite this article: Shin, Y. et al. First draft genome sequence of the rock bream in the family Oplegnathidae. Sci. Data. 5:180234 doi: 10.1038/sdata.2018.234 (2018).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.