Draft genome sequences of two oriental melons, Cucumis melo L. var. makuwa

Oriental melon (Cucumis melo L. var. makuwa) is one of the most important cultivated cucurbits, and is grown widely in Northeast Asian countries. With increasing interest in its biological properties and economic importance, oriental melon has become an attractive model crop for studying various horticultural traits. A previous genome sequence of the melon was constructed from a homozygous double-haploid line. Thus, individual reference genomes are required to perform functional studies and further breeding applications. Here, we report draft genome sequences of two oriental melons, Chang Bougi and SW3. The assembled 344 Mb genome of Chang Bougi was obtained with scaffold N50 1.0 Mb, and 36,235 genes were annotated. The 354 Mb genome of SW3 was assembled with scaffold N50 1.6 Mb, and has 38,173 genes. These newly constructed genomes will enable studies of fruit development, disease resistance, and breeding applications in the oriental melon.

Bougi 9 comprised 11,309 scaffolds totaling 344 Mb in length, with scaffold N50 of 1.0 Mb. For SW3, 7,202 scaffolds totaling 354 Mb in length were assembled 10 , with scaffold N50 of 1.6 Mb ( Table 2). Repeat annotation was then carried out (Table 3). K-mer frequencies were calculated to provide information related to low frequencies, sequencing depth, level of heterozygosity, and genome size (Fig. 2) 11 . The estimated genome sizes of Chang Bougi and SW3 were 355 Mb and 373 Mb, respectively, which were similar to previously reported genome sizes 5 . A total of 36,235 and 38,173 genes were determined as final genes in Chang Bougi and SW3, respectively (Table 2 and Fig. 3). Then functional annotation of final gene models were performed (Table 4 and Fig. 4). Finally, we provide new reference genome of oriental melons for further analysis and breeding program.

Methods
DNa extraction and sequencing. Leaves of two oriental melons were harvested and frozen immediately in liquid nitrogen. Genomic DNA was extracted, and paired-end and mate-pair libraries for next-generation sequencing were constructed according to the manufacturer's instructions (Illumina, San Diego, CA, USA). The quality of each library was validated using the KAPA SYBR FAST Universal 2× qPCR Master Mix (Kapa Biosystems, Boston, MA, USA). Each library was sequenced with the Illumina HiSeq 2500 platform.
Genome assembly. Pre-processing analyses of raw sequences, using in-house pipeline and genome assembly, were performed as described in previous studies 4,12 . After pre-processing to remove erroneous sequences in raw data, remaining sequences in paired-end libraries were assembled using Platanus 13 , with parameters for Chang Bougi (-k 63 -c 5 -d 0.3 -t 40 -m 220) and for SW3 (-k 91 -c 5 -d 0.3 -t 44 -m 200). The scaffolding process was performed with Platanus, using paired-end and mate-pair sequences, with parameters for Chang Bougi (-l 3 -s 61 -u 0.2 -t 40), and for SW3 (-l 3 -u 0.2 -t 15). Remaining gaps were filled with Platanus and GapCloser version 1.10 (http://soap.genomics.org.cn/down/GapCloser_release_2011.tar.gz), using reads from the paired-end and  Genome annotation. Annotation of the two genomes were performed using the KOBIC annotation pipeline (a modified PGA pipeline 14 ), consisting of repeat masking, mapping of different protein sequence sets, and ab initio prediction performed by AUGUSTUS v3.2.2 15 . Transcript assembly was performed with the assembled genome by a reference-based algorithm using HISAT2 16 and StringTie 17 . To generate protein-based gene models for consensus modeling, the protein sequences of Arabidopsis thaliana (TAIR10, http://www.arabidopsis.org), Citrullus lanatus 18 , Cucumis melo 5 , and Cucumis sativus 19 were mapped using GeneWise v2.1 20 . AUGUSTUS was used for gene prediction in the two oriental melon genomes. To validate the predicted gene models, protein sequences from the genomes of C. lanatus, C. melo, C. sativus, and A. thaliana were used as queries in BLASTp, and erratic gene models were filtered with a BLASTp cut-off of query coverage ≥0.3. Also, the assembled transcripts were validated against the same four sets of protein sequences using tLBASTn, and filtered with cut-off values of query coverage ≥0.5 and subject coverage ≥0.3. The GeneWise gene models that remained were reformatted from GeneWise format to GFF3 data, and used to determine the consensus gene model via EVM 21 , which combines ab initio gene predictions with protein alignments into weighted-consensus gene structures (ab initio predictions = 1, protein alignment = 5, transcript alignment assemblies = 7). Ultimately, the final gene models included a total of 36,235 consensus genes for Chang Bougi and 38,173 consensus genes for SW3 (Table 2 and Fig. 3).

Fig. 3 Comparisons of gene models for two oriental melon genomes and other genomes. (a) Gene length distribution (b) CDS length distribution (c) Exon number distribution (d) Intron length distribution (e) Intron number distribution. x-axis stands for length (bp) of gene (a), CDS (b) and intron (d) or numbers of exon (c)
and intron (e), respectively. y-axis stands for ratio of genes.   24 . Functional annotation of the final gene models (Table 4 and Fig. 4) predicted 2,093, 3,703, and 493 genes as hypothetical protein, uncharacterized protein, and unknown function, respectively, in the Chang Bougi genome. In the SW3 genome, respectively 2,245, 3,827, and 570 genes were predicted as hypothetical protein, uncharacterized protein, and unknown function.

Data records
All of the raw sequence reads produced by Illumina HiSeq 2500 have been deposited at NCBI Sequence Read Archive (SRA) under BioProject number PRJNA531526 (accession SRP191487) 8 and BioSample from SAMN11368505 to SAMN11368524 (SAMN11368505 ~ SAMN11368515 for Chang Bougi; SAMN11368516 ~ SAMN11368524 for SW3). The Whole Genome Shotgun project of Chang Bougi have been deposited at DDBJ/ ENA/GenBank under the accession number SSTD00000000 9 under PRJNA531576 and SAMN11370205. The Whole Genome Shotgun project of SW3 have been deposited at DDBJ/ENA/GenBank under the accession number SSTE00000000 10 under BioProject number PRJNA531478 and BioSample SAMN11381272.

technical Validation
Detection and filtration of misannotated genes. EvidenceModeler predicted 39,977 and 42,535 consensus genes for Chang Bougi and SW3, respectively. We investigated these to detect misannotated genes, as recommended by NCBI GenBank, including genes containing internal stop codons, genes lacking a stop codon, frame-shifted genes, or erroneous start codons. A total of 3,742 and 4,362 misannotated genes were detected and masked in Chang Bougi and in SW3, respectively. Thus, 36,235 genes remained in the Chang Bougi genome, and 38,173 genes remained in SW3. evaluation of genome annotation using BUSCO. BUSCO v3.0.2 25 provides an assessment of assembled genome completeness based on the orthologous group, with single-copy genes from OrthoDB (http:// www.orthodb.org), and using a hidden Markov model to profile amino acid alignments. For genome annotation www.nature.com/scientificdata www.nature.com/scientificdata/ assessments, we used 1,440 gene sets of orthologs conserved in embryophyta ( Table 5). The results showed that nearly all of these core genes/orthologs were present in the genomes of Chang Bougi (85.28%) and SW3 (86.81%).
Comparison of gene sets in the genomes of oriental melons Chang Bougi and SW3 with those in the genomes of melon (DHL92 v3.6.1) and cucumber. To compare gene sets between oriental melons and previously reported cucurbit genomes, orthologous and paralogous genes were detected in melon genome (DHL 92 v3.6.1), Chang Bougi, SW3, and cucumber (C. sativus) using the program OrthoFinder 26 . A total of 113,006 sequences were clustered into 30,738 groups, with 3,475 and 4,469 singleton genes detected in Chang Bougi and in SW3, respectively (Fig. 5). Fewer singleton genes might be expected in the two oriental melons than in the melon genome, which was constructed from a homozygous DHL92 double-haploid line, derived from a cross between Korean landraces of oriental melon (Songwhan Chamoe, PI 161375) and melon (Piel de Sapo). In addition, 2,213 genes were determined as common among melon and the two oriental melons, and 12,983 genes were detected in all four genomes. Functional investigation of singleton genes of Chang Bougi and SW3 indicated that 869 and 1,112 of genes were functionally unknown genes, respectively.

Code availability
The sequence data were generated using software provided by the sequencing platform manufacturer, and were processed with publicly available software and recommended settings, as cited in this report. No custom computer codes were generated in this work.