The chromosome-level quality genome provides insights into the evolution of the biosynthesis genes for aroma compounds of Osmanthus fragrans

Sweet osmanthus (Osmanthus fragrans) is a very popular ornamental tree species throughout Southeast Asia and USA particularly for its extremely fragrant aroma. We constructed a chromosome-level reference genome of O. fragrans to assist in studies of the evolution, genetic diversity, and molecular mechanism of aroma development. A total of over 118 Gb of polished reads was produced from HiSeq (45.1 Gb) and PacBio Sequel (73.35 Gb), giving 100× depth coverage for long reads. The combination of Illumina-short reads, PacBio-long reads, and Hi-C data produced the final chromosome quality genome of O. fragrans with a genome size of 727 Mb and a heterozygosity of 1.45 %. The genome was annotated using de novo and homology comparison and further refined with transcriptome data. The genome of O. fragrans was predicted to have 45,542 genes, of which 95.68 % were functionally annotated. Genome annotation found 49.35 % as the repetitive sequences, with long terminal repeats (LTR) being the richest (28.94 %). Genome evolution analysis indicated the evidence of whole-genome duplication 15 million years ago, which contributed to the current content of 45,242 genes. Metabolic analysis revealed that linalool, a monoterpene is the main aroma compound. Based on the genome and transcriptome, we further demonstrated the direct connection between terpene synthases (TPSs) and the rich aromatic molecules in O. fragrans. We identified three new flower-specific TPS genes, of which the expression coincided with the production of linalool. Our results suggest that the high number of TPS genes and the flower tissue- and stage-specific TPS genes expressions might drive the strong unique aroma production of O. fragrans.


Introduction
Sweet osmanthus (Dicotyledons, Lamiales, Oleaceae, Osmanthus) is one of the most popular, evergreen ornamental tree species in China due to its unique sweet aroma 1,2 . More than 160 cultivars of O. fragrans have been classified based on phenotypes such as the leaf shape, flower color, aroma, season and frequency of flower blooming 3 . The association between phenotypes and genotypes of O. fragrans has been examined through aroma compounds [4][5][6] , essential oils 7-10 , and taxonomy using various molecular markers [11][12][13][14][15][16] . Transcriptome studies have determined the genes that might be responsible for the emission of flower scent in O. fragrans 17,18 . Gene expression has also been modulated at different flowering stages of O. fragrans 19 . Differential gene expression studies have identified genes in the mediated isopentenol production (MEP) pathway, as well as the terpenoid-and carotenoid-synthesis pathways. Transcriptomics studies allowed researchers to make connections between the major flower aroma compounds, and differentially expressed genes and encoded proteins. The flower aroma compounds, (R)-and (S)-linanool are produced by a terpene synthase(s) (TPS) 20 . Another key flower aroma compound, β-ionone, is produced through the oxidative cleavage of β-carotene by carotenoid cleavage enzymes (CCD) [21][22][23] . Transcriptome studies have shown that TPS(s) and CCD(s) are differentially expressed at different flowering stages in O. fragrans 24,25 . Additionally, we and others have recently reported a set of transcription factors (TFs) associated with the expression of color and the emission of fragrance in O. fragrans 21,26,27 . All of these gene expression studies provide valuable insights on how flower blooming and aroma production are interlinked 28 . However, a genome sequence is largely needed to reveal the full genetic background of aroma production in sweet osmanthus and the evolution of aroma in the family Oleaceae.
In this study, we generated a reference genome for O. fragrans to provide a solid foundation for our future understanding of the genome structure and evolution of the Oleaceae family. Furthermore, we conducted a detailed analysis of the aroma compounds, tissue and flowering time-specific differential gene expression to investigate the molecular mechanisms of sweet fragrance development in O. fragrans.

Sequencing summary
We generated 100-fold PacBio single-molecule long reads (a total of 73.4 Gb with an N50 length of 13.0 kb), 77-fold k-mer depth Illumina paired-end short reads (45.1 Gb) and Hi-C data that produced 23 unambiguous chromosome scaffolds for a high-quality assembly. For stepwise assembly, we first performed an initial PacBioonly assembly, resulting in an assembly size of 733.5 Mb and a contig N50 of 1.59 Mb, and the assembled genome had a highly complete BUSCOs (96.1 %) (Supplemental Table 1). Then, the initial contigs were subsequently polished with PacBio long reads and Illumina short reads. As the final step, Hi-C data were used to polish the scaffolds generated by the PacBio and Illumina reads.

Determination of genome size and heterozygosity
The k-mer method 29 and KmerFreqAR 30 were used to determine the genome size of O. fragrans using the quality-filtered reads of Illumina data. The genome size was estimated based on the formula: Genome size = Modified k-mer number/average k-mer depth, where modified k-mer = Total k-mer number−error k-mer number and the average k-mer depth obtained from the main peak of k-mer distribution curve (Supplemental Fig. 1). To determine the heterozygosity, Arabidopsis genome data were used to simulate Illumina PE reads, which was carried out by using pIRS software 31 . Then, a fitting KmerFreqAR 30 was developed using the k-mer distribution curve of O. fragrans. When the two k-mer curves were consistent, the heterozygosity of Arabidopsis was considered the reference for the heterozygosity of O. fragrans. The final analysis produced~1.45% heterozygosity of the O. fragrans genome.
Genome assembly and quality assessment The integrated work-flow of genome assembly is shown (Supplemental Fig. 2). The full PacBio long reads were converted to fasta format. Then, all subreads of genome data were assembled using Falcon v0.3.0 32 with specific parameters (length-cutoff pr = 8 kb; length-cutoff pr = 9 kb). We used Arrow (https://github.com) to polish the draft genome (G1) to obtain the corrected genome (G2). Then, G2 was polished again by Pilon 33 , which mapped the next-generation sequencing data to G2 with bwa to obtain the twice-corrected genome (G3). The O. fragrans genome had high heterozygosity, which led to a G3 size larger than the estimation. To acquire the nonredundant genome, heterozygous and redundant sequences were removed from the corrected genome using Redundans 34 with the following parameters: heterozygosity = 0.0145 and Sequencing Depth = 86. The nonredundant genome (G4) was~741 Mb, with a contig N50 size of 1.595 Mb (Table 1). Finally, BUSCO v3.0 analysis 35 was performed to assess G4 using the embryophyta_odb10 database with default parameters.
The clustering of contig by hierarchical clustering of the Hi-C data was performed. Through a comparative analysis, the only pair of reads around the DpnII digestion site was determined. Hi-C linkage was used as a criterion to measure the degree of tightness of the association between different contigs by standardizing the digestion sites of DpnII on the genome sketch. Agglomerative hierarchical clustering and LACHESIS produced chromosome assembly maps with a karyotype of 2n = 46 ( Fig. 1). As a result, the total number of contigs of the O. fragrans genome map was 5327, and the total length was  Table 2).
We used homology-based and de novo approaches to identify transposable elements. RepeatMasker 37 was used to search against the Repbase (v. 22.11) 38 and Mips-REdat libraries 39 . Then, we used RepeatMasker v4.0.6 to search   Table 4).

Annotation of noncoding RNA (ncRNA)
We identified rRNA, miRNA, and snRNA genes in the O. fragrans genome by searching the Rfam database (release 13.0) 40 Table 5).

Gene prediction
The protein-coding genes were identified using homology-based and de novo predictions-based approaches. The O. fragrans genome was mapped against the published sequences of Arabidopsis thaliana, Olea europaea, Sesamum indicum, Solanum tuberosum, and Vitis vinifera. To accurately identify spliced alignments, we used GeneWise v2.2.0 44 to filter all initially aligned coding sequences. For de novo prediction, the data from NGS and the full-length transcriptomes were analyzed with hisat2-2.1.0 and PASApipeline-2.0.2 to predict the complete gene set. We randomly selected 1000 genes to train the model parameters for Augustus v3.3 36 , GeneID v1.4.4 45 , GlimmerHMM 46 , and SNAP 47 . The final consensus gene set was generatedusing EVidenceModeler (EVM) v1.1.1 48 , which combined the genes predicted by the de novo and homology searches 49,50 The assembled genome had 45,542 genes with an average transcript length of 4065 bp, an average CDS length of 1142 bp, and a number of exons per gene of 5 (Supplemental Table 6).
The functional validity of the predicted genes was further evaluated by searching the UniProt (release 2017_10), KEGG (release 84.0), and InterPro (5.21-60.0) databases using Blastall44, KAAS,49, and InterProScan50. As a result, we were able to assign potential functions to 43,573 protein-coding genes out of the total of 45,542 genes in the O. fragrans genome (95.68 %) (Supplemental Table 7). axillaris, Petunia inflata, Prunus mume, Rosa chinensis, Solanum lycopersicum, and V. vinifera). We applied the OrthoMCL (v2.0.9) pipeline 51 (BLASTP E-value ≤ 1e−5) to identify the potential orthologous gene families between the genomes of these plants. Gene family clustering identified 17,513 gene families consisting of 38,808 genes in O. fragrans, of which, 1086 gene families were unique to O. fragrans. O. europaea, and F. excelsior had the biggest number of shared gene families among these plants (Fig. 2).

Synteny analysis
We used the protein sequences of O. fragrans that were aligned against each other with Blastp (E-value ≤ 1e−5) to achieve the conserved paralogs, Then, MCScanX (http:// chibba.pgml.uga.edu/mcscab2) was used to find the collinearity block in the genome. Using the Circos tool (http://www.circos.ca), we mapped and gene density, GC content, Gypsy density, and Copia density, as well as the average expression value of genes expressed in flowers on individual chromosomes (Fig. 3).

Whole-genome duplication (WGD)
To determine the source of the high number of genes (>45,000) in O. fragrans, the WGD events were analyzed by taking advantage of the high-quality genome of O. fragrans. We applied four-fold synonymous third-codon transversion (4DTv) and synonymous substitution rate (Ks) estimation to detect the WGD events. First, respective paralogous of O. fragrans, G. max, O. europaea, V. vinifera, and A. thaliana were identified with OrthoMCL. Then, the protein sequences of these plants were aligned against each other with Blastp (E-value ≤ 1e−5) to achieve the conserved paralogs of each plant. Finally, the WGD events of each plant were evaluated based on their 4DTv (Fig. 4a) or Ks (data not shown) distribution. The WGD analysis suggestted that O. fragrans, G. max and O. europaea experienced WGD events within less than 15 MYA, but V. vinifer and A. thaliana have not experienced WGD events recently (Fig. 4a). We also compared the number of duplicated genes (Fig. 4b), the chromosomelevel duplications (Fig. 4c), and the number of a functional homologs of glycotransferase and bHLH-Myc transcription factor genes between O. fragrans and V. vinifera (Fig. 4d), further validating the WGD events.

Expression analysis
We also produced comprehensive transcriptome dataset using both HiSeq and the Iso-Seq pipeline. We focused our further analysis on identifying the specific genes responsible for floral development and the biosynthesis of volatile aroma compounds in O. fragrans. The members of MADs transcription factors that control plant development were highly expressed in all tissues tested. Among them, AG, AP3/PI, AP1, and SEP were predominantly expressed in the early flower stage (S1), whereas, the expression level of the ANR1 gene family was highly specific to the root tissue (Fig. 5b). Interestingly, the numbers of ABCE genes were higher than that of Fraxinus chinensis, a close relative of O. fragrans (Fig. 5a).
The major component of the volatile compounds in the floral scent of O. fragrans, linalool (Fig. 6), is known to be synthesized by terpenoid synthetases (TPS). Therefore, we compared the expression profiles of TPS genes and identified over 40 genes that contain the functional motifs of TPS. Differential gene expression (DGE) analysis identified 7 TPS genes that are highly expressed in flowers, compared to roots, leaves, and stems (Fig. 7).

Discussion
Sweet osmanthus is one of the most beloved ornamental tree species in China and other parts of the world and has been cultivated for over two-thousand years in China due to its attractive traits of beautiful colors, unique aromas, a long flowering season, and medicinal efficacy. However, there is a limited number of studies that have investigated the genetic basis of the phenotypic diversity of sweet osmanthus. Recently, a set of genetic markers was identified 14,15 , and an effort to construct a genetic linkage map was reported 16 . Additionally, several transcriptomics studies identified a large number of genes that are differentially expressed in some of the cultivars with attractive traits 17,18 . While these studies indirectly associate the diverse phenotypes with the genotypes of sweet osmanthus, there is no genome information that can directly link the specific genes to particular traits. Thus, we have sequenced, assembled and annotated the genome of sweet osmanthus. Furthermore, combining HiSeq-and IsoSeqbased transcriptome analyses, we gained deep insight into The high-quality reference genome provides deep insights to the evolution of O. fragrans Currently, there are still no comprehensive analyses combining genomic, transcriptomic, and metabolic approaches to reveal the unique aroma of O. fragrans. Despite advances in second-generation sequencing, it is still very challenging to construct a high-quality plant genome due to the high complexity, large size, and high percentage of repeats and polyploidy. Therefore, we combined the second-generation short read to achieve high accuracy, the third-generation long reads for de novo assembly, and Hi-C to scaffold contigs into a chromosome-scale assembly. To guarantee a high-quality genome annotation, we combined de novo, homologybased, and experimental evidence obtained from the extensive transcriptomics data, including the full-length transcripts. We constructed a reference-quality genome that produced an unambiguous chromosome-scale assembly (N = 23) and functionally annotated 43,573 genes out of the complete set of 45,542 genes of O. fragrans (95.68 %).
The number of genes, 45,542, is high and is more than the genes present in some of the plants that are related to O. fragrans (Fig. 2). This can be attributed to the repeated gene duplications which led to expansion of the gene families. The O. fragrans genome has higher number of multicopy genes compared to other plant species (Fig. 2). Furthermore, O. fragrans appears to have obtained and retained a large number of genes through the whole genome duplications (Fig. 4). The majority of plant species have experienced genome duplications in their evolutionary past 52,53 . The high gene number of O. fragrans might be a result of complex interactions among various fragrans var ruixiangui might support an extensive breeding among cultivars throughout its history, although it is challenging to accurately determine the origin of the observed heterozygosity in the cultivar. Furthermore, as an androdioecious species 55 , the coexistence of selfing and crossing poses an additional challenge to trace the origin of the high heterozygosity. Recently, the first genetic map of O. fragrans was created using the SLAF-seq method 16 to provide a framework for understanding the genome organization. This linkage map has helped us assemble the reference genome and can help to investigate the origin of the high heterozygosity and history of hybridization among the cultivars of O. fragrans.
The new genome can also be used as a reference for the whole genome resequencing of sweet osmanthus cultivars. Resequencing these whole genomes of various cultivars provides highly useful information on the potential drivers for the phenotype diversity, evolution, and   population structure of a given species 56 . Our preliminary genome sequencing of 30 different cultivars of O. fragrans identified a large number of single nucleotide polymorphisms (SNP), copy number variation (CNV), insertion sdeletion (InDel), structural variations (SV) and other mutation sites (unpublished results). Using the above mutation loci as new molecular genetic markers, researchers can study the history of cultivation, population dynamics and genetic diversity.
The whole-genome duplication and the tandem duplication of the biosynthetic genes is likely the cause for the strong sweet aroma of O. fragrans Among TPS-family genes, the TPS-b and g subfamilies are known to synthesize monoterpenes 57 . Linalool, the major aroma compound identified in our study, is produced by the monoterpene synthesis pathway in O. fragrans. Using the high-quality genome and deep transcriptome information, we found a significant expansion of TPS as a whole, and of subfamilies b and g specifically, compared with the grape (V. vinifera), which did not have whole genome duplication (Fig. 5). In addition to TPS1, 2, 3, and 4, which have been previously functionally validated, we identified seven additional TPS genes that are specifically expressed in flower stages S1 and S3. Three TPS genes appear to be new genes that are flower specific, indicating that the production of fragrance is controlled by a complex network involving multiple TPS genes functioning in time-and flower-specific manners. Our results suggest that the unique aromas of O. fragrans are some of the outcomes of the interrelationship between genome evolution, transcriptional regulation, and metabolic control. Our current work lays a solid foundation for further studies on the comparative genomics, molecular and biochemical mechanisms of aroma development in O. fragrans.

Conclusion
We constructed a high-quality reference genome of O. fragrans by combining Illumina, PacBio and Hi-C platforms. The genome of O. fragrans var. rixianggui is 740 Mb and has a high heterozygosity of 1.45 %. A large number of genes (45,542) was predicted by the gene models built with de novo, homology-based, and experimental data obtained from extensive transcription results. Our deep genome analysis indicates evidence of whole-genome duplication at~15 MYA. Our new genome information should help the research community study the genome structure, genetic basis of genetic diversity, and regulation of the flowering process and

Genome sequencing
For genome sequencing, leaf samples were collected from a male tree (O. fragrans var. rixianggui) on the campus of Nanjing Forestry University, Nanjing, China, and were processed for genomic DNA isolation and library construction. Rixianggui (Semperfloren) is a unique cultivar because it has a strong aroma and blooms continuously, except in hot summer months, while other cultivars, for example, Thunbergii, Latifolius, and Aurantiaeus, bloom only in autumn. Genomic DNA was extracted using the CTAB method, size fractionated with BluePipin (Sage Science, Inc, MA, USA), used for library construction following the PacBio SMRT library construction protocol, and sequenced on the PacBio Sequel platform (Pacific Biosciences, CA, USA). For Illumina library construction, the extracted DNA was fragmented and size-fractionated using g-tube and BluePipin, then subjected to paired-end library construction and sequenced on the HiSeq X ten platform (Illumina Inc, CA, USA).

Hi-C sequencing
To ensure the quality of the Hi-C library, leaf samples were initially examined forintegrity of the nuclei by DAPI staining. Once confirmed for high quality nuclei, the samples were processed following the Hi-C procedure [58][59][60] . The Hi-C library was sequenced on the Illumina HiSeq X ten platform (Illumina, CA, USA), generating 740 million Hi-C read pairs, which were submitted to the Lachesis Hi-C scaffolding pipeline 58 . Hi-C libraries produce different molecular types, including invalid pairs of self-circles, dangling-ends, and dumped-pairs. According to the different molecular types that lead to the alignment of paired reads on the genome in different directions, the unique alignment of reads on the genome needs to be statistically analyzed. Once recognized as an effective interaction, the final data only retained effective interactions. According to the above rules, the position of the DpnII digestion site in the reference genome was used, because it can also provide useful information on the structural organization of individual chromosomes.

Transcriptome sequencing
To obtain information that can assist in the empirical annotation of genes, full-length transcriptome sequencing Fig. 6 The top 7 secondary metabolites produced by the osmanthus flower measured by GC-MS. Note that the top molecule is linalool, a monoterpene, which was produced most in the S1 stage of the flower, and then decreased its amount in the flower. S1, S2, and S3 stand for the early, middle, and late stages, respectively was performed. The samples from flowers at three different blooming stages (S1: beginning, S2: middle, S3: late; Suppl. Figure 3), leaves, stems, and roots were collected from the same tree described above and processed for library construction. The total RNAs were extracted according to the manufacturer's instructions of TRNzol Universal Reagent (Cat# DP424, TIANGEN Biotech Co. Ltd, Beijing, China). The quality and quantity of the RNA samples were evaluated using a NanoDrop™ One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA), a Qubit® 3.0 Fluorometer (Thermo Fisher Scientific, USA) and an Agilent Bioanalyzer 2100 (Agilent Technologies, USA). All RNA samples with integrity values close to 10 were used for cDNA library construction and sequencing. The cDNA library was prepared using the TruSeq Sample Preparation (Illumina Inc, CA, USA) and IsoSeq Library Construction kits (Pacific Biosciences, CA, USA), and paired-end sequencing with 150 bp was conducted on a HiSeq X ten platform (Illumina Inc, CA, USA).

Aroma compound analysis
Fresh flowers at three different stages (S1: beginning, S2: middle, S3: lat), defined by the size of the flower (Supplemental Fig. 3), were picked from the same tree at the time of sample collection for the transcriptome studies described above. Sampling was replicated five times, and the samples were quickly put into polyethylene bags impermeable to gases, kept frozen and stored at −20°C. Headspace solid phase microextraction (SPME) combined with gas chromatography-mass spectrometry (GC-MS) was used to determine the identity and quantity of the aroma volatiles. Flowers (0.3 g) were placed in a 4 mL solid-phase microextraction vial (Supelco Inc, USA), 1 μl of 1000× diluted ethyl caprate (Macklin Inc, China) was added, and vials were capped with a 65 µm DB-5 ms extraction head (Supelco Inc, USA). Then, the vial was incubated for 40 min in a water bath at 45°C to volatilize the aroma compounds and release them into the headspace. After the adsorption period, the fiber head was removed and introduced into the heated injector port of the GC for desorption at 250°C for 3 min. The desorbed volatile compounds from DB-5 ms were analyzed on a Trace DSQ GC-MS (Thermo-Fisher Scientific, USA), equipped with a 30 m x 0.25 mm × 0.25 mm TR-5 ms capillary column (Supelco Inc, USA). The oven temperature was programmed at 60°C for 2 min, increasing at 5°C/min to 150°C, then increasing at 10°C/min to reach 250°C, followed by maintaining the temperature of the transfer line at 250°C. Helium was taken as the carrier gas at a linear velocity of 1.0 mL/min. Mass detector conditions on MS were: source temperature: 250°C and the electronic impact (EI) mode at 70 eV, with a speed of 4 scans/s over the mass range m/z 33-450 amu in a 1 s cycle. Volatile compounds were first auto-matched by mass spectra using the NIST98 database through Chem-Station (Agilent, USA). A series of n-alkanes (C7-C30) (Sigma St. Louis, MO) was injected into the GC-MS set to obtain the linear retention indices of the volatile compounds, and they were analyzed under the same conditions. The data were also compared with published linear retention indices (NIST Chemistry WebBook, SRD 69). The normalization of peak-areas was used to calculate the quantities of the volatile aroma compounds.