De novo assembly and characterization of the first draft genome of quince (Cydonia oblonga Mill.)

Quince (Cydonia oblonga Mill.) is the sole member of the genus Cydonia in the Rosacea family and closely related to the major pome fruits, apple (Malus domestica Borkh.) and pear (Pyrus communis L.). In the present work, whole genome shotgun paired-end sequencing was employed in order to assemble the first draft genome of quince. A genome assembly that spans 488.4 Mb of sequence corresponding to 71.2% of the estimated genome size (686 Mb) was produced in the study. Gene predictions via ab initio and homology-based sequence annotation strategies resulted in the identification of 25,428 and 30,684 unique putative protein coding genes, respectively. 97.4 and 95.6% of putative homologs of Arabidopsis and rice transcription factors were identified in the ab initio predicted genic sequences. Different machine learning algorithms were tested for classifying pre-miRNA (precursor microRNA) coding sequences, identifying Support Vector Machine (SVM) as the best performing classifier. SVM classification predicted 600 putative pre-miRNA coding loci. Repetitive DNA content of the assembly was also characterized. The first draft assembly of the quince genome produced in this work would constitute a foundation for functional genomic research in quince toward dissecting the genetic basis of important traits and performing genomics-assisted breeding.

germplasm characterization with transferable molecular genetic tools, namely SSR (simple sequence repeat) markers developed for apple and/or pear genomes 8,[12][13][14] . In other work, generic, random marker systems including ISSR (inter simple sequence repeat) 4 , RAPD (random amplified polymorphic DNA) 15 and AFLP (amplified fragment length polymorphism) 16 were utilized for the molecular genetic characterization of quince collections. Genome assemblies provide the foundation for further extensive molecular genetic research in agriculturally relevant species. In the present work, a draft genome assembly of quince was produced for the first time using whole genome shotgun paired-end sequencing. Assembly characterization was performed by ab initio and homology-based gene predictions, as well as microRNA coding loci identification employing machine-learned classification. Repetitive portion of the assembly was characterized by the analyses of transposable element content and microsatellite composition.

Materials and methods
DNA isolation, sequencing and sequence pre-processing. DNA was extracted from leaf tissue of C. oblonga rootstock clone 'Quince A' using a modified CTAB protocol 17 as follows: 200 mg liquid nitrogen frozen, ground leaf tissue was mixed with 800 μL of CTAB extraction buffer [100 mM Tris-HCl (pH 8.0), 20 mM EDTA (pH 8.0), 1.4 M NaCl, 2% (w/v) CTAB, 1% PVP] and 5 μL of RNase A (0.01 mg/μL), and incubated at 65 °C for 1 h. Lysed sample was mixed with 600 μL of chloroform:isoamyl alcohol (24:1) and centrifuged at 20,900×g for 10 min. The supernatant phase was incubated with 200 μL of isopropanol at 4 °C for 1 h for DNA precipitation. DNA pellet was collected by centrifugation at 4 °C for 10 min at 20,900×g, washed with 100 μL of 70% ethanol and re-suspended in 100 μL of Tris-EDTA buffer (pH 8.0).
Paired-end sequencing was provided by Macrogen NGS Service (Macrogen Inc., Korea) using an Illumina NovaSeq 6000 platform. A paired-end sequencing library of median insert size of 450 bp was prepared using a TruSeq DNA PCR-Free kit according to TruSeq DNA PCR-Free Sample Preparation Guide prior to sequencing. Reads of 151 bp length were obtained as a result of paired-end sequencing. Data were filtered for reads that pass the Q30 score (92.77% of the total reads) and barcode adapters were trimmed using FASTX-Toolkit version 0.0.13 (http://hanno nlab.cshl.edu/fastx _toolk it/index .html).
Sequence assembly and quality evaluation. Short reads were assembled using SOAPdenovo2 version 2.04 18 with a k-mer length of 127 and filtered for a minimum read length of 500 bp. Assembly completeness based on gene content was assessed using the BUSCO (Benchmarking Universal Single-Copy Orthologs) pipeline 19 . The lineage dataset used was eudicots_odb10 (Creation date: 2019-11-20, number of species: 31, number of BUSCOs: 2326). Contigs that represent the plastid genome were aligned to the C. oblonga plastid genome assembly (GenBank Accession: NC_045415.1).
Gene prediction and functional annotation. Gene prediction was performed using both ab initio and homology-based strategies. Ab initio gene prediction was performed using the AUGUSTUS gene prediction server 20 via the OmicsBox version 1.3 platform (https ://www.bioba m.com/omics box/). Assembled C. oblonga genome sequences were used as the input for ab initio annotation process. Repeat Masking application via RepeatMasker Open-4.0 21 under the Genome Analysis module of OmicsBox was run prior to gene prediction. Functional annotation was performed using the output file of the ab initio gene prediction process. Accordingly, the UniProt collection of apple (Malus domestica Borkh.) protein sequences (45,359 sequences) (https ://www. unipr ot.org/unipr ot/?query =malus +domes tica&sort=score ; Access date: September 2020) and plant transcription factor database (PlnTFDB version 3.0) 22 were used to create local databases for running blastp searches. The E-value threshold used was 1E−10. GO mapping was performed on the output data and GO terms were assigned to the predicted genes with the subsequent GO annotation process. Homology-based gene prediction was also performed with an E-value threshold of 1E−10 based on M. domestica proteins using the genome assembly file as the input and the UniProt collection of M. domestica protein sequences as the local database. miRNA coding loci identification. Ab initio pre-miRNA (precursor microRNA) detection employing a machine learning algorithm and homology-based pre-miRNA search were used in combination to predict pre-miRNA coding loci in C. oblonga genome assembly. Toward this aim, 7579 plant pre-miRNA sequences available on miRbase (Release 22.1) 23 were downloaded and used for training and testing classifiers. Alongside the positive pre-miRNA dataset, a negative training and testing sample set was used that consisted of plant protein coding sequences. Pre-miRNAs and the negative sequence dataset were parameterized using k-mer (k = 4) method 24 . Frequency of each k-mer in the input sequence was detected and used to create a matrix as the parameterized version of the biological sequences. k-mers with unambiguous base calls (N, R, Y, S, W, etc.) were filtered out from the data. Datasets were split into training (70%) and testing data (30%) for training classifiers and evaluating their performance. Random forest (RF), Support Vector Machine (SVM), Naïve Bayes (NB), XGBoost, K-Nearest Neighbors (KNN) classifiers were trained and tested for their performance using the metrics of precision, accuracy, recall (sensitivity) and F-measure according to the following Eqs. www.nature.com/scientificreports/ SVM was the best performing machine-learned model for pre-miRNA prediction and used for the de novo identification of putative miRNA coding loci in the assembly. The trained SVM classifier was run on Scikit-learn 0.23.2 library 26 in order to scan the assembly with a predefined sliding window of 70 nucleotides. The output of SVM classification was used for the homology search with plant pre-miRNA sequences deposited in miRBase (7579 sequences). The E-value threshold used for the homology search was 1E−10 with a word size of 7.

Characterization of transposable elements (TEs) and microsatellite composition. Domain
Based Annotation of Transposable Elements (DANTE) was performed via the RepeatExplorer server 27 (Access date: September 2020) for the identification and characterization of the transposable element content by domain search and subsequent phylogenetic annotation and classification. Microsatellite composition was analyzed using GMATA version 2.01 (Genome-wide Microsatellite Analyzing Tool Package) 28 . Mononucleotide repeat loci of minimum 10 bases and 2-6 nucleotide microsatellite motifs of min. 5 repeats were mined in the assembly.

Results and discussion
Genome assembly and quality evaluation. Paired-end sequencing produced 494.5 million reads of 151 bp length. The total size of the raw sequence reads was 74.7 Gb. Sequence Read Archive (SRA) files are deposited at GenBank under the BioProject PRJNA675337. Relative GC and AT content of the sequence reads was 38.7 and 61.3%, respectively. Ratio of bases with phred quality score over 20 (Q20) was 97.1% and the ratio with quality score over 30 (Q30) was 92.8% (Table 1). Assembly of the barcode-trimmed reads that pass the Q30 filter produced 303,932 contigs larger than 500 bp. The total size of the assembly was 488.4 Mb, corresponding to 71.2% of the C. oblonga genome size (686 Mb) estimated by flow cytometry 29 . N50 value of the assembly was calculated as 2.4 kb and the maximum contig size was 53.8 kb (Table 1). Assembly files are deposited at GenBank under the accession JADOBS000000000. Searching BUSCO sets enables the quantitative assessment of genome completeness using an evolutionary measure, informed expectation of orthologous gene content 19 . Assessing the assembly completeness via BUSCO identified 1312 BUSCOs (81.2%) out of 1614 BUSCO groups of the Embryophyta database. 1024 of the identified BUSCOs were complete BUSCOs, 869 of which were complete single-copy and 155 were complete duplicated orthologs. 288 out of 1312 BUSCOs were fragmented. The number of missing BUSCOs in the genome assembly was 302 (18.8%). Complete list of BUSCO IDs, identification status of each ID, gene locations in the assembly and gene descriptions are provided as Supplementary Table S1.  Table S2). Direct homology-based annotation was also performed in addition to ab initio gene prediction, and identified 30,684 putative genes based on homology with M. domestica proteins (Supplementary Table S3). As a result of GO (gene ontology) mapping and www.nature.com/scientificreports/  www.nature.com/scientificreports/ annotation of ab initio predicted genic sequences, a total of 192,631 GO annotations were obtained for the three categories, biological process (BP), cellular component (CC) and molecular function (MF) at a mean level of 6.2 (Fig. 1a). The highest number of GO term assignments was obtained for 'cellular process' in the BP category (Fig. 1b). 'cellular and anatomical entity' was the top GO term for the CC category and 'binding' was the top molecular function according to the number of identified GO terms in the MF category (Fig. 1c,d). Sub categorization of the top GO terms at level 3 is provided as Supplementary Fig. S1.  www.nature.com/scientificreports/ Transcription factors (TFs) have pivotal functions in almost all plant biological processes from seed germination to stress responses, therefore, play important roles in plant evolution and domestication 30 . A specific search for transcription factors in the ab initio predicted genic sequences was performed using the plant transcription factor database (PlnTFDB) 22 . The results were filtered for two species, Arabidopsis thaliana L. and Oryza sativa L. ssp. japonica. Arabidopsis and rice were specifically selected for defining the transcription factor composition in the assembly since they are the basic model species for eudicot and monocot genomics with well-annotated genomes and parallel expansion of TF orthologous groups in the two species has already been demonstrated 31 . PlnTFDB search identified 2686 of the 2757 annotated Arabidopsis TFs and 2981 out of 3119 annotated rice TFs in the assembly. Thus, 97.4 and 95.6% of the putative homologs of Arabidopsis and rice TFs were found to be present in the assembly. Putative homologs of the three largest TF families, bHLH (basic helix-loop-helix), MYB and AP2-EREBP (APETALA2-ethylene responsive element binding protein) were almost completely identified in the assembly with 175/177 bHLH, 166/166 MYB and 166/168 AP2-EREBP homologies (Fig. 2). The complete list of identified putative TF homologs is provided as Supplementary Table S4.
Prediction of miRNA coding loci. MicroRNAs regulate gene expression post-transcriptionally, playing crucial roles in plant development and stress responses. Therefore, during the last two decades, much effort has been devoted to miRNA identification 32 . The conventional miRNA identification route that relies on RNA sequencing can be biased toward abundant transcripts in addition to the dependence on tissue and developmental stage 33,34 . Thus, it is challenging to detect miRNAs that are constitutively expressed at low levels or expressed at specific tissues at very narrow time intervals 33 . Ab initio/de novo miRNA detection using Table 2. Results of miRNA prediction with the combined approach of SVM classification and homologybased identification. a pre-miRNA records from miRBase. mdm: Malus domestica; pla: Paeonia lactiflora; vvi: Vitis vinifera; gma: Glycine max. b Trained SVM classifier probability value. c E-value threshold applied for the BLAST analysis. www.nature.com/scientificreports/ genome scale sequences circumvents these problems and has proved successful in identifying pre-miRNAs in plant genomes 24,33,35,36 . As a result of related work, Support Vector Machine (SVM) has been reported as an effective machine-learned classifier to locate pre-miRNA coding loci in genome scale sequences 33,35,37 . In the present work, five different machine-learning algorithms (SVM, RF, KNN, XGBoost and NB) were trained and tested for their performance in predicting pre-miRNA coding genomic sequences. SVM was the best performing algorithm according to the applied performance metrics of accuracy, and F-measure (includes precision and recall measures) (Fig. 3a). The ROC (receiver operating characteristic) curve for the trained SVM classifier is provided as Fig. 3b. According to the results of the classifier performance test, the C. oblonga genome assembly was scanned with the trained SVM classifier, resulting in the identification of 600 putative pre-miRNA coding loci (Supplementary Table S5). The subsequent homology search using the SVM predicted loci as the query and the miRBase record of plant pre-miRNAs as the database, identified 33 matches including 28 pre-miRNAs from M. domestica, two from Glycine max, two from Vitis vinifera and 1 pre-miRNA from Paeonia lactiflora ( Table 2).
Transposable element and microsatellite composition of the C. oblonga genome. TEs actively contribute to eukaryotic genome evolution 38 and often modify peripheral gene expression via altered epigenetic marks. Such transposon induced gene expression patterns may directly be involved in the natural and/or artificial selection of certain advantageous genotypes in plant species. For example, red skin color in apple is subject to both natural and artificial selection and, a recent study associates the trait with a LTR (long terminal repeat) retrotransposon insertion upstream of the MdMYB1 gene, which is a core transcriptional regulator of the anthocyanin biosynthesis pathway 11 . In the present work, the TE composition of the C. oblonga genome assembly was characterized using a domain-based approach which includes domain search followed by phylogenetic TE annotation and classification. As a result, 104,710 TEs were identified including 96,153 Class I and 8444 Class II elements (Table 3). Thus, the vast majority of TEs were retrotransposons, constituting 91.8% of the total number of TEs identified in the C. oblonga genome assembly. Most of the Class I elements were classified either as LTR/ Ty3/Gypsy or LTR/Ty1/Copia elements (Fig. 4a) and constituted 93.6% of the total retrotransposon content. LTR superfamily transposons are the predominant TE type in many plant genomes 9-11,39-44 and according to our TE composition analysis, quince genome is no exception with 85.9% of the total TE content identified as LTRs (Table 3). Class II elements constituted 8.1% of the total number of identified TEs. Helitrons and hAT superfamily transposons were the two most abundant DNA transposons, representing 26.3 and 29% of the Class II transposable elements. Out of the 104,710 TEs identified in the assembly, 113 loci (0.1%) were classified as 'ambiguous' since their phylogenetic assignment could not be inferred (Table 3). Detailed information on the identified TEs including the subfamily assignments is provided as Supplementary Table S6. Quince genome assembly was also characterized for microsatellite composition. Search parameters were applied to identify mononucleotide repeats and tandem repeats of 2-to-6 nucleotides. As a result, a total of 308,171 microsatellite loci were identified in the assembly, corresponding to an overall density of 1.6 kb/microsatellite interval. Microsatellite motifs and positions in the assembly are provided as Supplementary Table S7. 171,414 of the identified loci were mononucleotide repeats and 136,757 loci were repeats of 2-to-6 nucleotide motifs (Table 4). Similar to the results of the microsatellite search in the quince genome, mononucleotide repeats appear as the most abundant repeat type in several plant genomes in case they are included in repeat mining analyses 45,46 . Stretches of A/T repeats predominated in the mononucleotide repeats, constituting 96.4% of the total number of identified mononucleotide microsatellites (Fig. 4b). Dinucleotide repeats (117,091 loci) were the most abundant microsatellite type in the pool of 2-to-6 nucleotide microsatellites ( Table 4). The most frequent dinucleotide motif was AT (27,583 loci), representing 23.6% of the total number of dinucleotide microsatellites (Fig. 4b). Overall, AT-rich repeats predominated in C. oblonga genomic microsatellites (Fig. 4b). These results

Conclusion
Quince is a neglected member of the Rosaceae family in terms of genomic studies. The species is closely related to the major pome fruits, apple (M. domestica) and pear (P. communis). Quince fruits are processed to different food products, liqueurs and aromatic distillates and, quince tree is a common rootstock for pear production. Yet, genomic research in quince has lagged far behind other pome fruit species. As a result of the present study, the first draft genome of quince was assembled from whole genome shotgun paired-end reads. Gene content was characterized by ab initio and homology-based gene predictions. Machine-learned classification methods were tested and applied for pre-miRNA coding loci predictions. Our results identified SVM as an appropriate machinelearning algorithm for de novo genomic pre-miRNA coding loci prediction. Transposable element content was characterized with a domain-based search and phylogenetic classification approach, identifying a very similar transposon composition with the closely related and well-characterized apple genome. Microsatellite content of the genome assembly was also analyzed and reported. Data produced in the present work provide insights into the quince genome for the first time and constitute a basis for further functional genomic research in quince.

Data availability
This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession JADOBS000000000, BioProject PRJNA675337. Sequence Read Archive (SRA) files have been deposited under the same BioProject (PRJNA675337). Gene features file describing the de novo predicted genes in gff format can be accessed at https ://doi.org/10.6084/m9.figsh are.13538 942. Assembly characterization data generated during this study are included in this published article as Supplementary Information files. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.