Draft genome of Tanacetum cinerariifolium, the natural source of mosquito coil

Pyrethrum (Tanacetum cinerariifolium), which is a perennial Asteraceae plant with white daisy-like flowers, is the original source of mosquito coils and is known for the biosynthesis of the pyrethrin class of natural insecticides. However, the molecular basis of the production of pyrethrins by T. cinerariifolium has yet to be fully elucidated. Here, we present the 7.1-Gb draft genome of T. cinerariifolium, consisting of 2,016,451 scaffolds and 60,080 genes predicted with high confidence. Notably, analyses of transposable elements (TEs) indicated that TEs occupy 33.84% of the genome sequence. Furthermore, TEs of the sire and oryco clades were found to be enriched in the T. cinerariifolium-specific evolutionary lineage, occupying a total of 13% of the genome sequence, a proportion approximately 8-fold higher than that in other plants. InterProScan analysis demonstrated that biodefense-related toxic proteins (e.g., ribosome inactivating proteins), signal transduction-related proteins (e.g., histidine kinases), and metabolic enzymes (e.g., lipoxygenases, acyl-CoA dehydrogenases/oxygenases, and P450s) are also highly enriched in the T. cinerariifolium genome. Molecular phylogenetic analysis detected a variety of enzymes with genus-specific multiplication, including both common enzymes and others that appear to be specific to pyrethrin biosynthesis. Together, these data identify possible novel components of the pyrethrin biosynthesis pathway and provide new insights into the unique genomic features of T. cinerariifolium.

Insect-borne infection is one of the most familiar and prevalent diseases worldwide. To date, a wide variety of insecticides and insect repellents have been developed to prevent insect-borne infection. Mosquito coils, which were first commercialized by Dainihon Jochugiku Co., Ltd., in 1902, are the origin of potent insecticides and insect repellents that are harmless to humans. The natural source of the original mosquito coil was the flower head of pyrethrum (Tanacetum cinerariifolium), a perennial Asteraceae plant with a white daisy-like flower (Fig. 1A). The major insecticidal compounds are T. cinerariifolium-specific specialized metabolites designated pyrethrins. These phytochemicals are biosynthesized by esterification of acid moieties (chrysanthemic acid and pyrethric acid) and alcohol moieties (pyrethrolone, jasmolone, and cinerolone) 1 , yielding a total of six compounds, pyrethrin I and II, jasmolin I and II, and cinerin I and II, by combination of the acid and alcohol moieties 1 in T. cinerariifolium. Pyrethrins exhibit highly selective neurotoxicity to insects and are degraded rapidly in the presence of sunlight and oxygen 2,3 . Characterization of pyrethrins led to the development of pyrethroids, structurally related compounds that are easily synthesized 4 . Both pyrethrins and pyrethroids have been extensively employed as common insecticides and insect repellents that have provided protection against insect-borne diseases such as malaria and dengue fever in many countries 5,6 .
To date, some pyrethrin biosynthesis-related enzymes have been identified 1,[7][8][9][10][11] (Fig. 1B). Regarding the generation of alcohol moieties of pyrethrins, linolenic acid is converted to 13-hydroperoxylinolenic acid by T. cinerariifolium 13-lipoxygenase (TcLOX1) and subsequently converted to jasmone by allene oxide cyclase (AOC), allene oxide synthase (AOS), 3-oxo-2-(2-pentenyl)-cyclopentane-1-octanoic acid reductase 3 (OPR3), and additional unknown enzymes. Jasmone is converted to jasmolone by T. cinerariifolium jasmone hydroxylase (TcJMH). Although pyrethrolone and cinerolone are thought to be derived from jasmolone, the mechanisms and enzymes responsible for this conversion have yet to be elucidated. Regarding the synthesis of the acid moieties of pyrethrins, two molecules of dimethylallyl diphosphate are condensed to chrysanthemyl diphosphate by T. cinerariifolium chrysanthemyl diphosphate synthase (TcCDS); the product subsequently is oxidized to . Proposed biosynthetic pathway of pyrethrins. Chrysanthemic acid and pyrethric acid as acid moieties are synthesized from dimethylallyl diphosphates by T. cinerariifolium chrysanthemyl diphosphate synthase (TcCDS), T. cinerariifolium alcohol dehydrogenase 2 (TcADH2), T. cinerariifolium aldehyde dehydrogenase 1 (TcALDH1), T. cinerariifolium chrysanthemol 10-hydroxylase (TcCHH), and T. cinerariifolium 10-carboxychrysanthemic acid 10-methyltransferase (TcCCMT). Pyrethrolone, jasmolone, and cinerolone as alcohol moieties are synthesized from linolenic acid by enzymes related to oxylipin pathway, including T. cinerariifolium jasmone hydroxylase (TcJMH), and unknown enzyme(s). Finally, chrysanthemoyl-CoA (the CoA-activated version of chrysanthemic acid) and pyrethrolone are esterified by T. cinerariifolium GDSL (Gly-Asp-Ser-Leu motif) lipase (TcGLIP) to produce pyrethrin I. Compounds and enzymes are indicated by red and blue text, respectively. Unidentified enzymes are indicated as question marks. chrysanthemic acid by T. cinerariifolium alcohol dehydrogenase 2 (TcADH2) and T. cinerariifolium aldehyde dehydrogenase 1 (TcALDH1). The other acid moiety, pyrethric acid, is generated by T. cinerariifolium chrysanthemol 10-hydroxylase (TcCHH) and T. cinerariifolium 10-carboxychrysanthemic acid 10-methyltransferase (TcCCMT). Finally, CoA-activated chrysanthemic acid (chrysanthemoyl-CoA) and pyrethrolone are esterified to pyrethrin I by T. cinerariifolium GDSL (Gly-Asp-Ser-Leu motif) lipase (TcGLIP). While TcGLIP has been isolated and identified as the enzyme catalyzing the esterification of pyrethrin I, the enzymes that catalyze the synthesis of pyrethrin II, jasmolin I, jasmolin II, cinerin I, and cinerin II have yet to be identified. In this context, elucidation of T. cinerariifolium genome sequence is expected to provide crucial clues for clarification of the whole biosynthetic pathway of the pyrethrins and of the evolutionary processes. The genome sequence also is expected to facilitate the development of molecular tools for the generation of transgenic T. cinerariifolium and other systems for efficient and sustainable production of natural pyrethrins that can be difficult to chemically synthesize.
Genomes of various members of the Asteraceae family, including Chrysanthemum seticuspe 12 , Artemisia annua 13 and Helianthus annuus 14 , have been reported over the past few years, but no pyrethrin-producing member of the Asteraceae family have been sequenced 15 . The genome of T. cinerariifolium was estimated to be approximately 7.1 Gb by a flow cytometric method 16 , indicating a genome that is more than twice the sizes of other Asteraceae genomes (e.g., 2.72 Gb for C. seticuspe, 1.74 Gb for A. annua and 3.6 Gb for H. annuus). Assembling such large genomes still remains challenging due to the presence of highly repetitive sequences, multiplied paralogous genes, and heterozygosity 17 . Here, we report a draft genome of T. cinerariifolium derived by massively parallel sequencing using Hiseq4000 and X. The repetitive sequence content and multiplied pyrethrin-biosynthetic genes included in the assembled genome sequence also were analyzed.

Result
Sequence assembly and annotation of T. cinerariifolium genome. Paired-end (PE) libraries and mate-pair (MP) libraries (with 3-, 5-and 8-kb insert sizes; MP-3 kb, MP-5 kb, and MP-8 kb, respectively) of the T. cinerariifolium genome were generated and then sequenced using Hiseq X and Hiseq4000 instruments. The total bases of the sequence reads for PE, MP-3 kb, MP-5 kb, and MP-8 kb amounted to 1,497 Gb, 135 Gb, 166 Gb, and 95 Gb, respectively (Table 1). Subsequently, the output reads were subjected to contig assembly and scaffolding steps to estimate the genome sequence (Fig. 2). The 3,892,368 contigs with total length of 7.7 Gb (Table 2) were constructed by assembling PE reads using SOAPdenovo 18 and Platanus 19 and merging the contigs using SSPACE-Longread 20 . The assembled contigs then were scaffolded with PE and MP reads using SSPACE 21 , which concatenates contig sequences with pair information of the reads, and Gapfiller 22 , which fills the inter-contig unknown bases ('N's) in scaffolds with ' A/T/G/C' (Fig. 2). The total length of the resultant scaffolds was 7.1 Gb, which corresponded with the flow cytometry-estimated genome size for T. cinerariifolium 16 . The N50 value of the scaffolds was 14 Kb, and the maximum length of the scaffolds was 1.2 Mb (Table 2). Subsequently, the draft genome was subjected to Augustus 23 with 'intron hints' , which were generated with previous T. cinerariifolium transcriptomes 24,25 , resulting in the prediction of 935,992 putative genes.
To evaluate the completeness of the genome, the draft genome sequences were subjected to analysis with BUSCO 26 , which counts complete (C), fragmented (F), and missing (M) conserved genes in genome sequences. The analysis with 1440 conserved core plant genes showed that 91.8% of conserved genes (82.3% as complete and 9.5% as fragmented) were present in the T. cinerariifolium genome assembly (Table 3). Since these scores indicated quality as high as those of other plant genomes (with reported scores of 83.8% in Cuscuta campestris 27 , 77.5% in Ruellia speciosa 28 , 86.9% in Erigeron breviscapus 29 , and 89.0% in Secale cereal 30 ), we used the T. cinerariifolium draft genome for subsequent analysis.
To annotate the predicted genes, transposable elements (TEs) were first annotated using hmmpfam 31 against the Gypsy database (GyDB) 32   Flowchart of genome assembly and gene prediction. The PE reads were subjected to contig assembly by a four-step process, including "pre-assembly" using PANDAseq, "contig assembly" using SOAPdenovo and Platanus, "clean-up" using BLASTN, and "merging with other assembly result" using SSPACE-longread. Then, the MP reads were subjected to scaffolding by a three-step process, including "read selection" using bowtie2, "scaffolding" using SSPACE-STANDARD, and "gapfilling" using GapFiller. These processes yielded the complete draft genome. The coding sequences in the draft genome then were annotated by a three-step process, including "gene prediction" using Augustus, "transposable elements (TEs) detection" using hmmpfam, and "protein superfamilies prediction" using InterProScan.  Table 3. Annotation statistics for draft genome. * C: percentage of full-length conserved genes in BUSCO notation; F: percentage of fragmented conserved genes in BUSCO notation; M: percentage of missing genes in BUSCO notation; TE: transposable element.   we evaluated the number of co-clustered genes in single-genus clusters, which should reflect the number of genus-specific duplication events (Fig. 3). If these TEs had multiplied in a common ancestor, most TEs would be expected to be positioned in orthologous clusters with bootstrap values greater than 50% (see Fig. 3, example of non-multiplied clusters). Otherwise, most TEs would be expected to be positioned in multiplied clusters (Fig. 3).

Contigs Scaffolds
In fact, 98%, 92%, 93%, and 98% of sire TEs were separately multiplied in T. cinerariifolium (Fig. 3A), C. seticuspe (Fig. 3B), A. annua (Fig. 3C), and H. annuus (Fig. 3D), respectively, indicating that most of the sire TEs had multiplied in the individual genera. Likewise, 85%, 69%, 78%, and 89% of oryco TEs were multiplied in the respective organisms ( Fig. 3E-H), indicating that most of the oryco TEs also had multiplied in the individual genera. Moreover, 90% of sire TEs (Fig. 3A) and 49% of oryco TEs (Fig. 3E) clustered with eight or more multiplied genes. Collectively, these results suggest that sire and oryco TEs were multiplied in the individual lineages of the respective Asteraceae genera; in the case of T. cinerariifolium, TEs of these clades occupy approximately 13% of the genome. We also analyzed the distribution of TEs in genome scaffolds by counting the number of sire and oryco genes encoded in each 20-Kb region. As shown in Fig. 4, uniform distributions of sire and oryco genes were observed using frequency plots for sire (Fig. 4A) and oryco ( Fig. 4B) genes in each 20-Kb interval. Furthermore, no local distribution was detected by frequency plots for any of the TE clades (Fig. 4C). We next analyzed the number of TEs present within 20 Kb of another member of the respective TE clade. As shown in Fig. 4D, 82.35% and 92.65% of sire and oryco genes (respectively) were separated by more than 20 Kb from other TEs of the respective clade. Together, these data indicated that the TEs that multiplied in T. cinerariifolium, are scattered throughout the entire genome.

functional annotation of the T. cinerariifolium genes and inter-genus comparative analysis.
Next, we investigated multiplication ratios of protein superfamilies in T. cinerariifolium compared with those in other species. This analysis was performed using InterProScan 33 . Specifically, the protein data sets of C. seticuspe, www.nature.com/scientificreports www.nature.com/scientificreports/ A. annua, H. annuus, N. tabacum, O. sativa, and A. thaliana were subjected to analysis using InterProScan and the multiplication odds scores were calculated for each combination of superfamily and species (Table 6). A positive value of the multiplication odds score indicates that a genus possesses a higher number of multiplied genes in a given superfamily than do other genera. As shown in Table 6, the highest multiplication odds scores for T. cinerariifolium were observed for the biodefense-, signaling-, and metabolism-related superfamilies.
For biodefense-related superfamilies, "Ribosome-inactivating protein (RIP)" (IPR036041) and "Ricin B-like lectins" (IPR035992), both of which are categorized as type II RIPs, showed multiplication odds scores of 2.05 and 1.57 in T. cinerariifolium, respectively (Table 6). RIPs, including ricin, show high toxicity to a wide range of species, serving as biodefense molecules for the producing plant. For example, tobacco RIP, isolated and purified from N. tabacum leaves, has been reported to possess strong antibacterial activity against Pseudomonas solanacearum, Erwinia amylovora, and Shigella asonei 34 . Pokeweed antiviral protein, purified from the extracts of plant leaves of Phytolacca americana 35 , has been reported to enhance this plant's systemic resistance to tobacco mosaic virus infection 36 . Sambucus nigra agglutinin I (SNA-I) has been shown to exhibit insecticidal potency against Hemiptera 37 . Indeed, a BLASTP search of SNA-I sequence against T. cinerariifolium RIPs detected Tci_399175 with an E-value score of 2.99 × 10 −112 and 37.89% identity (Supplemental Fig. 3A). Sequence alignment and BLASTP results revealed that Tci_399175 harbors the well conserved RIP domain and two RICIN domains containing Q-X-W motifs (Supplemental Fig. 3A). In other work, RIP domains have been implicated in the cleavage of rRNA glycosidic bonds, while RICIN domains have been shown to be involved in internalization via binding of target cell glycans 38 . This sequence analysis suggested that the T. cinerariifolium-enriched RIPs include insecticides similar to SNA-I. These findings indicated that T. cinerariifolium possesses not only low-molecular-weight insecticides such as sesquiterpene lactones 39 and pyrethrins 2,3 but also a greater number of genes (compared with other plants) encoding native proteins with toxicity against insects. www.nature.com/scientificreports www.nature.com/scientificreports/ For signaling-related superfamilies "Signal transduction histidine kinase, dimerization/phosphoacceptor domain" (IPR036097), "Rho GDP-dissociation inhibitor domain" (IPR024792), and "HECT, E3 ligase catalytic domain" (IPR035983) showed specific multiplication in T. cinerariifolium, with multiplication scores of 1.39, 1.32, and 1.14, respectively (Table 6). Some plant genera are known to possess histidine kinases (proteins that act as ethylene receptors) as well as non-ethylene-receptor histidine kinases (e.g., cytokinin receptors) 40 . A BLASTP search using A. thaliana ethylene-response 1 (ETR1) 41 , an ethylene-activated kinase, as a query detected Tci_144982, a predicted protein that includes a G-X-G motif-containing HATPase_c domain, with an E-value score of 0 and 69.18% identity (Supplemental Fig. 3B). This sequence analysis suggested that T. cinerariifolium-enriched histidine kinases include ethylene-receptor-type histidine kinases similar to ETR1. These T. cinerariifolium-specific multiplications of signaling molecules suggested that T. cinerariifolium has genus-specific enrichment of signal transduction and regulation proteins. While the molecular mechanisms of regulation of the accumulation and composition of pyrethrins are yet to be defined, the products of these enriched signaling-related genes are obvious candidates for regulators of pyrethrin biosynthesis.
The metabolism-related superfamily showed the strongest gene multiplication, with genes encoding "Enolase-like, C-terminal domain" (IPR036849), "Cytochrome c oxidase-like domain" (IPR036909) "Lipoxygenase, C-terminal domain" (IPR036226), "Acyl-CoA dehydrogenase/oxidase N-terminal domain" (IPR037069), and "Metal-dependent hydrolase, composite domain" (IPR011059) exhibiting multiplication scores of 1.28, 1.22, 1.16, 1.13, and 1.11, respectively (Table 6). Since lipoxygenases (e.g., TcLOX1) and acyl-CoA dehydrogenases/oxidases (enzymes catalyzing reactions related to the beta-oxidation of 12-oxo-cis-10,15-phytodienic acid) are known to be involved in the oxylipin pathway leading to the synthesis of pyrethrin alcohol moieties (Fig. 1B), these multiplication scores are compatible with the specific biosynthesis of pyrethrins and jasmonates in T. cinerariifolium. Furthermore, "Cytochrome P450" (IPR036396) was detected with a multiplication score of 0.37, representing a total of more than 745 genes in T. cinerariifolium and making the corresponding loci the most abundant class of genes in this plant. Indeed, TcJMH, a protein that is categorized as a P450, is known to be involved in the biosynthesis of jasmolone (Fig. 1B); thus, the multiplication of P450s is consistent with a role of this protein class in the T. cinerariifolium-specific biosynthesis of jasmonates. The ferritin-like superfamily, members of which are responsible for iron transport, was found to have a multiplication score of 1.42. In A. thaliana, ferritins 1, 3, and 4 have been shown to regulate the expression of multiple genes encoding P450 enzymes 42 . A BLASTP search with A. thaliana ferritin 1 (AtFer-1) against T. cinerariifolium ferritin-like proteins detected Tci_154278 with an E-value of 5.64 × 10 −115 and 69% identity, including a conserved eukaryotic ferritin domain-containing iron ion channel, a ferroxidase diiron center, and a ferrihydrite nucleation center (Supplemental Fig. 3C). In this regard, the multiplied genes encoding ferritin-like proteins in T. cinerariifolium are expected to include proteins that modulate the expression of various P450s, presumably including one or more that affect TcJMH expression. This result is consistent with the duplication of genes encoding metalloenzymes such as metal-dependent hydrolases, cytochrome c, aconitases, lipoxygenases, and P450s (Table 6), and suggests that T. cinerariifolium likely needs to import larger amounts (compared with other plants) of metals for use in various enzyme reactions and their regulation.
The lowest multiplication scores were found with "Expansin, cellulose-binding-like domain" (IPR036749), "Mitochondrial glycoprotein" (IPR036561), and "ArfGAP domain" (IPR038508), which exhibited multiplication scores of −0.53, −0.53, and −0.49, respectively ( Table 7). The multiplication scores of these superfamilies in O. sativa, C. seticuspe, A. annua, and H. annuus were as low as the multiplication scores in T. cinerariifolium (Table 6), indicating that the depletion of the genome for members of this superfamily was not unique to T. cinerariifolium among the Asteraceae. In contrast, "Endochitinase-like", a superfamily that plays a pivotal role in defense against fungal pathogens 43 , exhibited a T. cinerariifolium-specific low-copy status. The accumulation of RIPs ( Table 6) and depletion of endochitinases suggested that the corresponding biodefense strategy has evolved specifically in T. cinerariifolium.  www.nature.com/scientificreports www.nature.com/scientificreports/ Phylogenetic analysis of pyrethrin-related enzymes. One of the most characteristic features of T.
cinerariifolium is the presence of the pyrethrin biosynthetic pathway, components of which include TcLOX1 (T. cinerariifolium 13-lipoxygenase), TcJMH (T. cinerariifolium jasmone hydroxylase), TcCDS (T. cinerariifolium chrysanthemyl diphosphate synthase), and TcGLIP (T. cinerariifolium GDSL (Gly-Asp-Ser-Leu motif) lipase). TcLOX1 and TcJMH are related to the lipoxygenase and P450 superfamilies, classes that were shown (by InterProScan analysis) to be enriched in T. cinerariifolium. On the other hand, since TcCDS (unlike typical terpene synthases) has non-head-to-tail cyclase activity with isoprenoid units and TcGLIP (unlike typical GDSL lipases) has acyltransferase activity, there are no corresponding InterProScan superfamilies into which these enzymes can be categorized in terms of function. Therefore, we further analyzed the expansion of genes encoding TcLOX1, TcJMH, TcCDS, and TcGLIP using a BLAST search against the predicted proteomes of 13 different plant species available on public databases, and performed a molecular phylogenetic analysis using a maximum-likelihood tree with 500 bootstraps.
TcLOX1 is a member of a class of widely prevalent plant enzymes; TcLOX1 has been shown to be responsible for production of the alcohol moieties of pyrethrins via conversion of linolenic acid to 13-hydroperoxylinolenic acid 7 (Fig. 1B). A BLASTP search using the TcLOX1 sequence as a query detected 134 proteins, including 9 T. cinerariifolium proteins, 29 other Asteraceae proteins, and 96 non-Asteraceae proteins, with 42.03-57.58%, 42.31-95.26%, and 35.09-75.38% identity to TcLOX1, respectively. Molecular phylogenetic analysis of these proteins revealed that TcLOX1 formed an orthologous clade with C. seticuspe, A. annua, and H. annuus proteins ( Fig. 5A and Supplemental Fig. 4, clade I). This clade was positioned proximally to another clade consisting only of Asteraceae proteins ( Fig. 5A and Supplemental Fig. 4, clade II), indicating that these LOX families were duplicated in Asteraceae. In addition, these clades ( Fig. 5A and Supplemental Fig. 4, Clades I and II) belonged to a larger lipoxygenase clade ( Fig. 5A and Supplemental Fig. 4, clade III) that included A. thaliana LOX3 and 4, which have the same activity as TcLOX1. This LOX clade was consistent with the report that Asteraceae plants share the linolenic acid lipoxygenase activity 7 .
TcJMH catalyzes the conversion of jasmone to jasmolone 8 (Fig. 1B), which has been reported only in the pyrethrin-synthesizing genus. A BLASTP search using the TcJMH sequence as a query detected 136 proteins, including 9 T. cinerariifolium proteins, 30 other Asteraceae proteins, and 97 non-Asteraceae proteins with 34.22-81.53%, 37.67-85.01%, and 36.44-58.01% identity to TcJMH, respectively. Molecular phylogenetic analysis of these proteins revealed that TcJMH formed a subclade with one other T. cinerariifolium protein, two A. annua proteins, and one C. seticuspe protein ( Fig. 5B  . This inclusion within a large clade is in contrast with the fact that the conversion of jasmone to jasmolone has been reported only in the Tanacetum genus 8 . These results suggested that such non-Tanacetum TcJMH-like protein genes are non-functional or are involved in a pathway for biosynthesis of phytochemicals other than those observed in T. cinerariifolium. TcCDS participates in the formation of terpenoid acid moieties by conversion of dimethylallyl diphosphate to chrysanthemyl diphosphate 9 (Fig. 1B), which also has been reported to be a genus-specific pyrethrin-synthesizing reaction. A BLASTP search using the TcCDS sequence as a query detected 122 proteins, including 9 T. cinerariifolium proteins, 28 other Asteraceae proteins, and 85 non-Asteraceae proteins with 61.27-88.54%, 36.36-93.52%, and 20.21-66.57% identity to TcCDS, respectively. Molecular phylogenetic analysis of these proteins revealed that TcCDS formed a subclade with three A. annua proteins and three other T. cinerariifolium proteins (Fig. 5C and Supplemental Fig. 6, Clade I), within a large cluster consisting of seven T. cinerariifolium proteins and 17 Asteraceae proteins (Fig. 5C and Supplemental Fig. 6, Clade II). The phylogenetic tree for TcCDS-like proteins did not include explicit orthologous clades; this result was consistent with a study demonstrating that the production of genus-specific monoterpenes, including chrysanthemyl diphosphate, has diverged in individual Asteraceae genera 44 . Previously, A. annua was shown to have a large number of terpene-related loci in its genome; the products of these genes are responsible for the biosynthesis of various genus-specific terpenes 13 . These findings support the view that these terpene-biosynthesis genes also have multiplied and diverged in T. cinerariifolium, as reported for those of A. annua 13 .
TcGLIP synthesizes pyrethrin I by esterification of chrysanthemic acid moieties and pyrethrolone moieties in reactions specific to the pyrethrin-synthesizing genus 1 (Fig. 1B). A BLASTP search using the TcGLIP sequence as a query detected 135 sequences, including 9 T. cinerariifolium proteins, 30 other Asteraceae proteins, and 96 non-Asteraceae proteins with 48.73-81.78%, 31.87-79.35%, and 30.38-53.35% identity to TcGLIP, respectively. Molecular phylogenetic tree analysis of these proteins revealed that TcGLIP formed a subclade with one A. annua

Distribution analysis of pyrethrin-related enzymes. Recent studies have shown that genes encod-
ing components of a particular metabolic pathway, especially those responsible for specialized plant metabolites, are genomically co-localized and co-regulated as Metabolic Gene Clusters (MGCs) 45,46 . To find genes co-localized with pyrethrin-related MGCs, we initially subjected the genome sequences to PhytoClust 46 with default settings for cluster range and flanking region. However, no MGCs that included loci encoding pyrethrin biosynthesis-related enzymes were detected in the T. cinerariifolium genome. To permit further analysis, the distribution of genes within the scaffolds that included loci encoding TcLOX1, TcJMH, TcCDS and TcGLIP were analyzed using the GenomeJack software program. The TcLOX1-encoding gene was located on scaffold sc00017724 and showed co-localization with one gene encoding a hypothetical protein (Tci_094327) and four TE genes (Fig. 6A). A BLASTP search using the Tci_094327 sequence as a query detected the Artemisia annua squamosa promoter-binding (SPB) transcription factor (PWA57483.1) with an E-value of 2 × 10 −35 and 52.35% identity (Supplemental Fig. 8A). Since the A. annua SPB transcription factor regulates the expression of proteins responsible for artemisinin B biosynthesis 47 , Tci_094327 is a candidate regulator of the oxylipin pathway. The TcJMH-encoding gene was positioned on scaffold sc00012411 and showed co-localization with two genes encoding putative transporters (Tci_072426 and Tci_072429) and three TE genes (Fig. 6B). The TcCDS-encoding www.nature.com/scientificreports www.nature.com/scientificreports/ gene was positioned on scaffold sc00057709 and exhibited co-localization with the gene encoding Tci_214196 (a protein that lacks apparent homologs by BLASTP) and one TE gene (Fig. 6C). These results indicated that genes encoding other pyrethrin-associated enzymes were not distributed proximal to the loci encoding either TcJMH or TcCDS. In contrast, the gene encoding TcGLIP was positioned on scaffold sc00006304 and co-localized with loci encoding two GDSL family lipases (Tci_043407 and Tci_043410), one glutathione S-transferase (Tci_043405), one hypothetical protein (Tci_043408), and two TE genes (Fig. 6D). Tci_043407 and Tci_043410 showed 50.53% and 41.10% identity to TcGLIP (Supplemental Fig. 8B), respectively, suggesting that these genes had been generated by gene duplication and might encode enzymes with activities similar to that of TcGLIP. Thus, our results collectively indicated that a gene encoding the SPB transcription factor was proximal to the locus encoding TcLOX, and that genes encoding two GDSL lipases lie proximal to the locus encoding TcGLIP, suggesting that these novel genes may be co-regulated with those encoding pyrethrin-related enzymes known to be involved in pyrethrin biosynthesis.

Discussion
T. cinerariifolium is characterized by species-specific biosynthesis of the pyrethrins, a class of potent natural insecticides that are non-harmful to humans. A wide variety of specialized metabolites, including the pyrethrins, has been found to play defensive roles against several threats such as pests, pathogens, and herbivores 2,3,13,48 . Furthermore, the markedly larger genome size (7.1 Gb) of T. cinerariifolium, compared with those of other plants of the Asteraceae genera (2.72 Gb for C. seticuspe 12 , 1.74 Gb for A. annua 13 , and 3.6 Gb for H. annuus 14 ), suggests the occurrence of T. cinerariifolium-specific evolutionary processes. To address these issues, the present study explored the draft genome sequence of T. cinerariifolium. In general, the assembly of large-sized genomes, such as that of T. cinerariifolium, is frequently hindered or precluded by numerous sequence errors due to inappropriate sequencing methods and the presence of repetitive sequences 17 . Therefore, we sequenced the T. cinerariifolium genome to high depth, including 211-fold coverage by PE reads (to provide error correction) and 56-fold coverage by MP reads (to permit the assembly of repetitive sequences) (Table 1). Furthermore, PANDAseq (Fig. 2) also has been shown to be effective for large-genome assembly, since this program includes Illumina-optimized error correction and longer-read generation using overlapping paired-end information 49 . The assembly strategy with high-depth reads used in the present paper eventually elucidated a 7.1-Gb genome sequence with a N50 of 14 Kb www.nature.com/scientificreports www.nature.com/scientificreports/ and 91.8% completeness as judged by BUSCO analysis (Tables 2 and 3). The resultant genome sequence revealed that TEs and multiple superfamily genes are enriched in the T. cinerariifolium genome.
The sire-and oryco-clade TEs were shown to have accumulated after the divergence of T. cinerariifolium from other Asteraceae (Fig. 3, and Supplemental Figs. 1 and 2), with TEs of these clades occupying 11.16 and 2.14% of the genome sequence, respectively (Table 4). This accumulation of the sire-and oryco-clade TEs contributes to the large size of the T. cinerariifolium genome. Furthermore, TEs have been suggested to be involved in the multiplication of plant disease resistance-related genes in Capsicum 50 . In this context, the accumulation of the sire-and oryco-clade TEs in T. cinerariifolium also may contribute to the T. cinerariifolium-specific enrichment of signaling-, biodefense-, and biosynthesis-related enzymes, as detected by InterProScan analyses.
The T. cinerariifolium-enriched signaling-related superfamily genes included those encoding histidine kinases (Table 6), a group shown (using a BLASTP search and sequence alignment; Supplemental Fig. 3B) to include ethylene-receptor type histidine kinases similar to A. thaliana ETR1. These results suggested that T. cinerariifolium-enriched histidine kinases include proteins involved in responses to air-borne compounds. Notably, T. cinerariifolium upregulates pyrethrin biosynthesis by emitting a specific combination of volatile organic compounds (VOCs) when the plant body has been wounded 51,52 , suggesting that the upregulation of components of the pyrethrin biosynthetic pathway by wound-induced VOCs may occur via histidine kinase-mediated signal induction. Moreover, genes encoding lipoxygenases, which are known to contribute to the biosynthesis of VOCs 53 , also were shown to be enriched in a genus-specific fashion in T. cinerariifolium (Table 6). These findings provide crucial clues to the investigation of the VOC-dependent regulation of pyrethrin biosynthesis.
The T. cinerariifolium-enriched biodefense-related superfamily genes included those encoding ribosome inactivating proteins (RIPs) ( Table 6), which are toxic to herbivores 37 . Furthermore, a BLASTP search and sequence alignment also suggested that T. cinerariifolium-enriched RIPs include insecticidal proteins similar to SNA-I (Supplemental Fig. 3A). These results indicated that T. cinerariifolium has the potential to produce such toxic proteins as biodefense factors, in addition to the production of insecticidal phytochemical toxins such as the pyrethrins. In contrast, the T. cinerariifolium genome contains only a single gene encoding a putative endochitinase (Table 7), a class of proteins known to be involved in defense against fungal pathogens 43 . This observation accords with the low resistance of T. cinerariifolium against serious foliar diseases caused by fungi such as Stagonosporopsis tanaceti and Didymella tanaceti 54 , leading to the need for care in preventing fungal pathogen infection during the cultivation of T. cinerariifolium. Altogether, the present results support the view that T. cinerariifolium might have been adapted to the dry environment of its area of origin, the Balkans, where fungi are less numerous than in the areas of the plant's current cultivation. Consequently, this plant is likely to have been endowed with multiple RIPs as potent defenses against herbivores in addition to the acquisition of pyrethrins during the evolutionary processes.
T. cinerariifolium also was found to be enriched for genes encoding members of the lipoxygenase and P450 superfamilies (Table 6), which are involved in the pyrethrin biosynthetic pathway. The existence of unique genes related to pyrethrin biosynthesis is one of the most prominent features of the T. cinerariifolium genome. Molecular phylogenetic analyses of the enzymes for pyrethrin biosynthesis demonstrated that paralogs of TcGLIP, which synthesizes pyrethrin I (Fig. 1B), clustered in a T. cinerariifolium-specific clade ( Fig. 5D and Supplemental Fig. 7, clade II), while paralogs of TcJMH and TcCDS, which synthesize jasmolone and chrysanthemyl diphosphate, respectively (Fig. 1B), clustered in Asteraceae-specific clades (Fig. 5B,C and Supplemental Figs. 5 and 6, clade II). In contrast, TcLOX1-related enzymes, which are widely prevalent plant enzymes responsible for the synthesis of 13-hydroperoxylinolenic acid (Fig. 1B), clustered in a clade including non-Asteraceae proteins ( Fig. 5A and Supplemental Fig. 4, clade III). In particular, genes encoding TcCDS-related enzymes, which are categorized as terpenoid synthases, were as multiplied and varied in T. cinerariifolium as in A. annua, where this class of proteins is known to biosynthesize various genus-specific terpenes 13 (Fig. 5C and Supplemental Fig. 6). Consequently, these multiplied genes encoding terpenoid-biosynthetic enzymes of T. cinerariifolium are consistent with T. cinerariifolium-specific terpenoid synthesis. In contrast, TcJMH, the key enzyme for synthesis of the alcohol moieties of the pyrethrins (Fig. 1B), was shown to be highly conserved in other plants ( Fig. 5B and Supplemental Fig. 5), whereas the alcohol moieties of pyrethrins, including jasmolone, pyrethrolone, and cinerolone, have been identified only in members of the Tanacetum genus 1-3 . These results raise the question whether these alcohol moieties have not yet been identified in non-Tanacetum plants, or whether TcJMH-like proteins can act on other substrates.
InterProScan analysis revealed that the T. cinerariifolium genome is enriched for genes encoding members of the ferritin-like superfamily, proteins that are responsible for iron transport. Furthermore, various iron enzymes, including members of the lipoxygenase and P450 superfamily, a class that includes enzymes of the pyrethrin-biosynthesis pathway (including TcLOX1, TcJMH, and a putative P450 for conversion of jasmolone to pyrethrolone 55 ) also were shown to be multiplied (Table 6). Indeed, AtFer-1 is reported to regulate the expression of P450 in A. thaliana 42 , and T. cinerariifolium is enriched for genes encoding ferritins, included a locus encoding a ferritin similar to AtFer-1 (Supplemental Fig. 3C, Tci_154278). Taken together, these results support the view that the ferritin-like superfamily and pyrethrin-biosynthetic lipoxygenase and P450s might have co-evolved in this lineage, given these proteins' complementary roles in the uptake and biochemical use of iron. Thus, further work will need to examine the functional correlation of these multiplied ferritin-like superfamily proteins with multiplied genus-specific pyrethrin-biosynthesis enzymes.
Molecular phylogenetic analysis of TcGLIP, which plays a critical role in the esterification reaction that serves as the final step in the biosynthesis of pyrethrin I (Fig. 1B), revealed that genes encoding TcGLIP-related proteins have diverged in the genome of T. cinerariifolium (Fig. 5D and Supplemental Fig. 7). These results suggest that TcGLIP (and potentially other related proteins) acquired a role in esterification of pyrethrins in the specific lineage of this genus. Previously, the catalyzing activity of recombinant TcGLIP for the esterification of pyrethric acid with pyrethrolone was found to be lower than that of pyrethric acid with jasmolone, pyrethric acid with (2019) 9:18249 | https://doi.org/10.1038/s41598-019-54815-6 www.nature.com/scientificreports www.nature.com/scientificreports/ cinerolone, chrysanthemic acid with pyrethrolone, chrysanthemic acid with jasmolone, and chrysanthemic acid with cinerolone 1 . Combined with those previous findings, the results of the present study suggested that the newly detected TcGLIP-family proteins likely catalyze uncharacterized esterification reactions (Fig. 1B). Collectively, the present genomic and molecular phylogenetic analyses are expected to contribute to the verification of pathways leading to the biosynthesis of unidentified pyrethrins (e.g., the conversion of jasmolone to pyrethrolone) and the underlying regulatory mechanisms. Furthermore, the molecular data obtained in the present study are expected to facilitate the creation of transgenic and genome-edited T. cinerariifolium, paving the way to metabolic engineering for efficient phytochemical production, including that of additional pyrethrins.
The information on the co-localization (within the genome) of loci encoding related metabolic activities also is important, given that genes encoding components of a particular metabolic pathway, especially for specialized plant metabolites, frequently form such Metabolic Gene Clusters (MGCs) 45,46 . Although no MGCs for the enzymes known to be involved in pyrethrin biosynthesis were detected, a gene encoding a squamosa promoter-binding (SPB) transcription factor-like protein was observed to be co-localized with the gene encoding TcLOX1 (Fig. 6A). The A. annua SPB transcription factor regulates genes encoding proteins involved in the biosynthesis of artemisinin B, an A. annua-specific terpenoid 47 . Since TcLOX1 is not involved in synthesis of the terpenoid subunit of pyrethrin but rather in the synthesis of the alcohol moieties of pyrethrins (Fig. 1), it will be interesting to see if this co-localized SPB transcription factor-encoding gene is involved in pyrethrin biosynthesis. Similarly, we infer that newly detected TcGLIP-family proteins likely catalyze uncharacterized esterification reactions (Fig. 1B). Genes encoding two GDSL lipases also were found to be co-localized with the locus encoding TcGLIP (Fig. 6D). These newly detected lipases are candidates for uncharacterized esterification reactions involved in the synthesis of cinerin I and II, jasmolin I and II, and/or pyrethrin II (Fig. 1B).
In conclusion, we have elucidated the draft genome of T. cinerariifolum, the natural source of the original mosquito coils. InterProScan and molecular phylogenetic analyses revealed the T. cinerariifolum-specific multiplication of genes encoding toxic proteins, metalloproteins, ferritins, histidine kinases, and pyrethrin biosynthetic enzymes, which are consistent with the unique physiological, ecological, and phytochemical features of these compounds. The draft genome opens new avenues for exploration of unidentified enzymes in pyrethrin biosynthesis and elucidation of the evolutionary processes of T. cinerariifolum, and provides fundamental information for the development of metabolic engineering of further beneficial phytochemicals using transgenic or genome-edited T. cinerariifolum. De novo assembly of genome sequences. The obtained read sequences were cleaned for assembly as described in our previous study 56 . Adapters derived from the Truseq or Nextera mate-pair Sample Prep Kit, low-quality reads, and short reads (<36 bp in length) were trimmed by using Trimmomatic version 0.36 57 .

Materials and
Contig sequences were generated from the PE reads by a four-step process ("pre-assembly", "SOAPdenovo assembly", "clean-up", and "merging with other assembly result"). First, the cleaned PE reads, including overlaps, were detected and pre-assembled by PANDAseq 49 . Second, these pre-assembled reads and remaining reads were subjected to SOAPdenovo v2.04-r240 18 with multi k-mers from 80 to 127 to generate contig sequences. Third, the SOAPdenovo-generated contigs were cleaned by trimming the contigs showing more than 95% sequence identity with other contig sequences. Fourth, PE reads also were subjected to processing using the Platanus assembler 1.2.4 19 with default parameters, and the generated contigs were merged with SOAPdenovo-constructed contigs using SSPACE-longread v1-1 20 , including setting minimum overlap length to 36 bp, minimum number of links to 1, and maximum link ratio to 0.5. Scaffold sequences were generated by a three-step process ("read selection", "scaffolding", and "gapfilling"). First, PE and MP reads were mapped to contig sequences using bowtie version 2.3.4.3 58 with the-local option; read pairs mapped discordantly were selected for the following steps. Second, the selected reads and contigs were subjected to scaffolding using SSPACE-STANDARD version 3.0 21 (BaseClear), including setting the minimum number of links to 3. Third, the positions for which the bases were unknown (e.g., 'N') in the scaffolds were filled with ' A/T/G/C' by GapFiller v1-10 22 (BaseClear) to complete the draft genome, setting. including setting the minimum number of overlaps to 30 and the number of reads to trim off to 50. Furthermore, the completeness of the draft genome was evaluated using BUSCO-v3 26 with the embryophyta_odb9 protein set.
Gene prediction and annotation. The T. cinerariifolium RNA-Seq data available on NCBI Sequence Read Archive (SRR2062279 and SRR2064180 24 , as well as SRR5985187-5985194 25 ) were mapped to the assembled genome to generate the 'intron hints' for gene prediction. Subsequently, the assembled genome and 'intron hints' were subjected to Augustus 3.3.1 23 trained with available T. cinerariifolium genes in GenBank. TEs in predicted genes were identified by hmmpfam in HMMER 2.3.1 31 against GyDB 2.0 32 with an E-value cutoff of 1.0. To identify high-confidence genes, the non-TE genes with known protein signatures were detected using InterProScan 5.6-48.0 33 HA412HO_v1.1 14 ), A. annua (ASM311234v1 13 ), and C. seticuspe (CSE_r1.0 12 ). As described for TE detection in T. cinerariifolium, the coding regions of these genomes were estimated using Augustus 3.3.1 23 with the Arabidopsis model set and the default parameters, and the TEs in these predicted transcriptomes were extracted using hmmpfam 31 against GyDB 32 . According to GyDB 32 classifications, the percentage of genomic regions occupied by each clade of TE was calculated as an accumulation score.
Molecular phylogenetic trees for sire-and oryco-clade TEs also were estimated using the ORTHOSCOPE method 63 as described in our previous study 64 . The amino acid sequences of hmmpfam-extracted reverse-transcriptase domains encoded by the TEs in T. cinerariifolium, H. annuus, A. annua, and C. seticuspe were aligned with CLUSTAL W-mpi 0.13 65 , and maximum-likelihood phylogenetic trees based on a JTT matrix-based model 66 with 100 bootstraps were created using Fast Tree 2.1.10 (JTT model, CAT approximation) 67 . comparative analysis of protein superfamily content versus that in other plants. We also detected protein signatures using InterProScan 33 analysis in six genome-elucidated plants, using methods as described above for the InterProScan analysis of T. cinerariifolium. Having determined the number of genes possessing each superfamily signature, the multiplication odds score for each combination of InterProScan-detected superfamily signature (Sig) and plant genus (Genus) were calculated as follows: = + + N Genus Sig PS N Sig PS Multiplication odds score (Genus, Sig) log ( , ) ( ) 2 where N (Genus, Sig) represents the number of the plant-genus genes with an InterProScan-detected superfamily signature and PS represents the pseudo-count constant, which was set to 5.00.
In further analysis for functional proteins, we subjected the detected proteins of the superfamily with the highest multiplication odds score in each category to a BLASTP 68 search (version 2.7.1). S. nigra agglutinin I (SNA-I, accession No. O22415.1), A. thaliana ethylene-response 1 (ETR1, accession No. AAA70047.1) and A. thaliana ferritin-1 (AtFer-1, accession No. CAA63932.1) were used as queries for ribosome-inactivating proteins, signal transduction histidine kinase proteins and ferritin-like proteins, respectively. The detected protein pairs (SNA-I and Tci_399175, ETR1 and Tci_144982, and AtFer-1 and Tci_154278) were subjected to sequence alignment using CLUSTAL W-MPI 0.13 65 .
Molecular phylogenetic trees were further estimated for T. cinerariifolium pyrethrin biosynthetic enzymes as described in our previous study 63,64 . Briefly, TcLOX1-, TcJMH-, TcCDS-, and TcGLIP-related proteins were identified using BLASTP 2.7.1 68 75 . The BLAST hit sequences were screened using an E-value cut-off of 10 −3 , and the top ten hits were used for the subsequent phylogenetic analyses. These sequences were aligned using CLUSTAL W-MPI 0.13 65 , and maximum-likelihood phylogenetic trees based on a JTT matrix-based model 66 with 500 bootstraps were created using MEGA software 76 .
Metabolic Gene Clusters (MGCs) were investigated with PhytoClust 46 ; this analysis required definition of variables corresponding to the span of chromosomal region in which genes are clustered (cluster range) and the length of region to be searched for additional marker enzymes (flanking region). We set both variables to 20 Kbp, in accordance with the values used in the previous report 46 . The genes included in the same scaffolds as those encoding pyrethrin-related enzymes (TcLOX1, TcJMH, TcCDS and TcGLIP) were visualized with the GenomeJack software program (Mitsubishi Space Software, Tokyo, Japan).

Data availability
The draft genome sequences and annotations, along with the raw reads for PE and MP, have been deposited in the DNA Data Bank of Japan (DDBJ) under the BioProject accession Code PRJDB8358.