Zanthoxylum-specific whole genome duplication and recent activity of transposable elements in the highly repetitive paleotetraploid Z. bungeanum genome

Feng, Shijing; Liu, Zhenshan; Cheng, Jian; Li, Zihe; Tian, Lu; Liu, Min; Yang, Tuxi; Liu, Yulin; Liu, Yonghong; Dai, He; Yang, Zujun; Zhang, Qing; Wang, Gang; Zhang, Jisen; Jiang, Huifeng; Wei, Anzhi

doi:10.1038/s41438-021-00665-1

Download PDF

Article
Open access
Published: 03 September 2021

Zanthoxylum-specific whole genome duplication and recent activity of transposable elements in the highly repetitive paleotetraploid Z. bungeanum genome

Shijing Feng^1,2^na1,
Zhenshan Liu³^na1,
Jian Cheng⁴^na1,
Zihe Li⁵^na1,
Lu Tian^1,2,
Min Liu⁶,
Tuxi Yang^1,2,
Yulin Liu^1,2,
Yonghong Liu^1,2,
He Dai⁶,
Zujun Yang⁷,
Qing Zhang⁸,
Gang Wang⁸,
Jisen Zhang⁸,
Huifeng Jiang⁴ &
…
Anzhi Wei^1,2

Horticulture Research volume 8, Article number: 205 (2021) Cite this article

4785 Accesses
20 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Zanthoxylum bungeanum is an important spice and medicinal plant that is unique for its accumulation of abundant secondary metabolites, which create a characteristic aroma and tingling sensation in the mouth. Owing to the high proportion of repetitive sequences, high heterozygosity, and increased chromosome number of Z. bungeanum, the assembly of its chromosomal pseudomolecules is extremely challenging. Here, we present a genome sequence for Z. bungeanum, with a dramatically expanded size of 4.23 Gb, assembled into 68 chromosomes. This genome is approximately tenfold larger than that of its close relative Citrus sinensis. After the divergence of Zanthoxylum and Citrus, the lineage-specific whole-genome duplication event η-WGD approximately 26.8 million years ago (MYA) and the recent transposable element (TE) burst ~6.41 MYA account for the substantial genome expansion in Z. bungeanum. The independent Zanthoxylum-specific WGD event was followed by numerous fusion/fission events that shaped the genomic architecture. Integrative genomic and transcriptomic analyses suggested that prominent species-specific gene family expansions and changes in gene expression have shaped the biosynthesis of sanshools, terpenoids, and anthocyanins, which contribute to the special flavor and appearance of Z. bungeanum. In summary, the reference genome provides a valuable model for studying the impact of WGDs with recent TE activity on gene gain and loss and genome reconstruction and provides resources to accelerate Zanthoxylum improvement.

The genome of Magnolia biondii Pamp. provides insights into the evolution of Magnoliales and biosynthesis of terpenoids

Article Open access 01 March 2021

Chromosome-level and haplotype-resolved genome provides insight into the tetraploid hybrid origin of patchouli

Article Open access 18 June 2022

The chromosome-level reference genome of Coptis chinensis provides insights into genomic evolution and berberine biosynthesis

Article Open access 01 June 2021

Introduction

As close relatives of Citrus in the Rutaceae family, plants of the genus Zanthoxylum generate strong tingling and numbing sensations in the mouth, which together with the pungent taste of hot chili form the spicy-hot flavor of Asian cuisine. This genus contains ~250 species native to tropical and subtropical regions worldwide, including Asia, America, and Africa¹. Plants of this genus are well known for their ability to biosynthesize abundant important secondary metabolites, including flavonoids², terpenoids³, and olefinic alkamides^4,5,6. In particular, the tingling sensation caused by Zanthoxylum bungeanum is due to the accumulation of sanshools, a group of alkaloids that are unique to the genus Zanthoxylum^7,8. Research findings have also indicated that secondary metabolites from Zanthoxylum species exhibited anticancer⁹, anesthetic¹⁰, analgesic¹¹, antiwrinkle¹², anti-inflammatory and other biological activities, suggesting great potential of these chemicals in the development of new drugs. Therefore, this genus has been widely used in the food industry^3,13, cosmetics industry¹², and traditional medicines^14,15,16. The identification and utilization of critical medicinal and agrochemical compounds from Zanthoxylum plants have significant economic value and have thus attracted increasing research interest from plant biologists.

Zanthoxylum bungeanum (common name: HuaJiao), one of the earliest domesticated crops in this genus, has been cultivated for the last 2,000 years in southwest China², which is thought to be the center of origin of Zanthoxylum. This region harbors 36 of the 41 Chinese Zanthoxylum species¹⁷. Ancient Chinese people regard the fruits of Z. bungeanum as a symbol of fertility, wealth, and longevity. Evidence for the medicinal use of Z. bungeanum can be traced back to the earliest traditional Chinese medicine monograph ShenNongBenCaoJing (The Divine Farmer’s Classic of Materia Medica), written during the Han Dynasty (206 BC–220 AD). Since then, Z. bungeanum has been included in prescriptions for the treatment of numerous diseases^18,19,20. However, this plant was not used as a major spice until the Three Kingdoms period (3rd century AD). This timescale is much earlier than the introduction of hot chili to China. Currently, Z. bungeanum is still one of the major native spices widely consumed in China, with a cultivation area of 1.7 million hectares that accounts for an economic value of 4.0 billion USD (Supplemental Text). To date, multiple landraces and elite cultivars of Z. bungeanum have been developed through long-term conventional selective breeding efforts².

Despite its importance as a native spice crop, Z. bungeanum-related genetic research is almost nonexistent. The availability of whole genome sequences for Rutaceae has been limited to Citrus^21,22,23. This shortcoming impedes our understanding of the genome evolution and regulation of metabolic pathways for major characteristic constituents. Here, we present a reference genome of Z. bungeanum, employing a combination of three different sequencing technologies. The availability of the Zanthoxylum genome and transcriptome data not only highlights the unique evolutionary trajectory of the Zanthoxylum genome but also aids in deciphering the mechanisms of evolutionary regulation of metabolic pathways for alkamides, flavonoids, and terpenoids. Furthermore, the Zanthoxylum genome provides a good baseline for future comparative genomics in Rutaceae.

Results

Large genome assembly and annotation

Due to its commercial and genetic importance, we selected the widely cultivated Z. bungeanum ‘DaHongPao’ (2n = 136) for genome sequencing (Fig. 1A, B; Figs. S1 and S2). We performed whole-genome sequencing analysis using the PacBio Sequel platform and Illumina HiSeq 2500 platform from seven paired-end libraries, which yielded 430 Gb long PacBio single-molecule real-time (SMRT) reads (Table S1) and 214 Gb Illumina reads (Table S2) for genome assembly. We preliminarily obtained a raw assembled genome of 5.25 Gb. After polishing by NextPolish²⁴ and purging the haplotigs and error fragments by purge_dups, we obtained the final genome assembly with a length of 4.23 Gb and contig N50 of 410 kb, representing 95.5% of the estimated genome size by flow cytometry (Table 1, Fig. S3). However, this assembly was slightly larger than the estimated genome size by 21 kmer (4.11 Gb), which may be due to the high heterozygosity of Z. bungeanum (~2.87%, estimated by k-mer frequency, Fig. S4), as reported in pistachio²⁵ and Dendrobium officinale²⁶. We further scaffolded the Z. bungeanum genome to the chromosome scale using Hi-C scaffolding technologies. A total of 255.77 million valid Hi-C read pairs were mapped onto the draft assembly contigs using ALLHiC^27,28. Finally, we obtained a genome with a total size of 4.12 Gb (98% of the primary assembly), containing 68 pseudochromosomes with a scaffold N50 of 74.18 Mb and the longest scaffold of 119.5 Mb (Fig. 1C, D, Table 1, Fig. S5, Table S3).

**Fig. 1: Morphological characteristics and genomic landscape of *Z. bungeanum*.**

Table 1 Summary of the assembly and annotation of the Z. bungeanum genome.

Full size table

Putative protein-coding and microRNA genes were annotated based on a comprehensive strategy combining ab initio prediction, homology gene modeling, and transcriptional evidence obtained in this study (Fig. S6, Table S4). A total of 74,307 protein-coding genes were predicted from this assembly (Tables 1), 99.09% (73,633) of which were supported by the presence of homology to known proteins, the existence of known functional domains, or the presence of expressed transcripts (Table S5). Additionally, 2,282 noncoding RNA sequences were identified and annotated, including 422 microRNAs (miRNAs), 454 ribosomal RNAs (rRNAs), and 1,406 transfer RNAs (tRNAs) (Table 1). To assess the genome quality and annotation completeness, we checked the core gene statistics using Benchmarking Universal Single-Copy Orthologs (BUSCO) and Conserved Core Eukaryotic Gene Mapping Approach (CEGMA), which suggested that 97.59% (2,270 of 2,326) and 97.82% (448 of 458) of the genes were recovered, respectively (Tables S6 and S7). In addition, our assembled genome obtained a relatively high long terminal repeat (LTR) assembly index (LAI) score (15.36). Taken together, these comparisons indicated that our genome assembly attained reference-level quality.

The genome evolution of Z. bungeanum

The evolution of gene families was analyzed by comparing the Z. bungeanum genome with that of 16 other plant species, including Amborella trichopoda, Piper nigrum, Zea mays, Oryza sativa, Papaver somniferum, Vitis vinifera, Dimocarpus longan, Arabidopsis thaliana, Brassica napus, Gossypium hirsutum, Arachis hypogaea, Cucumis sativus, Sesamum indicum, Capsicum annuum, Citrus sinensis, and Nicotiana tabacum. In total, 577,729 genes were clustered into 52,558 orthologous gene families, of which 5,664 gene families were shared by all 17 species, representing the ancestral gene families, and 532 gene families were specific to Rutaceae plants (Fig. S7, for clarity, only Z. bungeanum, C. annuum, P. nigrum, D. longan, and C. sinensis are shown). We found a total of 1,693 Z. bungeanum-specific gene families consisting of 4,498 genes (Table S8), which were enriched in genes associated with C5-branched dibasic acid metabolism, terpenoid backbone biosynthesis, unsaturated fatty acid biosynthesis, and valine, leucine, and isoleucine biosynthesis, among others (Table S9). We also identified a total of 2,754 gene families that were significantly (P < 0.05) expanded in Z. bungeanum and 47 gene families that were significantly contracted since the split from the common ancestor with C. sinensis. However, C. sinensis showed fewer gene family expansions and more gene family contractions than other species in the order Sapindales (Fig. 2A). Based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) annotations, expanded gene families were highly enriched in various secondary metabolites, including sesquiterpenoid and triterpenoid biosynthesis, flavonoid biosynthesis, phenylpropanoid biosynthesis, linoleic acid metabolism, phenylalanine metabolism, and anthocyanin biosynthesis (Table S10).

**Fig. 2: The evolutionary history of *Z. bungeanum*.**

To investigate the evolution of Zanthoxylum, we derived 659 single-copy genes from the 17 species for phylogenetic analysis (Table S11). The resulting phylogeny indicated that Z. bungeanum was most closely related to C. sinensis, as expected, and that these two species formed the Sapindales clade along with D. longan. Molecular dating, derived using five calibration points, suggested that Z. bungeanum diverged from the most recent common ancestor of C. sinensis approximately 35.3 million years ago (MYA; 95% confidence interval [CI]: 18.47–57.67 MYA) (Fig. 2A). The families Rutaceae and Sapindaceae (D. longan) shared a common ancestor approximately 83.9 MYA (Fig. 2A).

There were significantly more multicopied gene families in Z. bungeanum than in other rosids (Fig. 2A, stack bar and Table S8), which is suggestive of at least one recent whole-genome duplication (WGD) event in the Zanthoxylum lineage. The distributions of synonymous substitutions per synonymous site (K_S) of paralogous genes in the Z. bungeanum genome showed a single peak at approximately 0.21, but no similar peak was identified in C. sinensis (Fig. 2B), suggesting the occurrence of a recent WGD event experienced by Zanthoxylum (η-WGD) that was not shared among other Rutaceae members. These results combined with the phylogenetic analysis (Fig. 2A) indicated that the η-WGD of Z. bungeanum occurred after the divergence of Citrus and Zanthoxylum. To investigate WGD in the Z. bungeanum genome, we performed a comparative genomic analysis of Z. bungeanum with C. sinensis and V. vinifera. We identified a 2:1 syntenic depth ratio in both Z. bungeanum-C. sinensis and Z. bungeanum-V. vinifera comparisons, and these syntenic blocks contained 6,258 and 5,578 pairs of gene models in the Z. bungeanum genome, respectively (Fig. S8). Genomic collinearity of Z. bungeanum with itself identified 2.50 G intragenomic blocks, including 50,631 gene pairs derived from the η-WGD event. Therefore, we concluded that a single Zanthoxylum lineage-specific η-WGD event occurred after the divergence between Zanthoxylum and Citrus. According to the divergence rate between Z. bungeanum and C. sinensis, the η-WGD event occurred approximately 26.8 MYA (Fig. 2A, B), which is much later than the ancient γ-WGD event (~120 MYA) that occurred in the ancestors of core eudicots. Additionally, we performed KEGG enrichment on the duplicated genes generated by η-WGD and found that most of them are involved in the proteasome, mRNA surveillance pathway, carbon fixation in photosynthetic organisms, plant hormone signal transduction, and some secondary metabolites, such as fatty acid metabolism, unsaturated fatty acid biosynthesis, pyruvate metabolism, and terpenoid backbone biosynthesis (Table S12).

The high number of chromosomes is an important feature of the Z. bungeanum genome. To assess the chromosome evolution of Zanthoxylum, we placed the 68 extant chromosomes into major groups, corresponding to regions most clearly identifiable as originating from one of the seven chromosomes that existed before the core eudicot triplication (γ-WGT, Fig. 2C). The 19 grape chromosomes were postulated to be the closest modern representative of the ancestral eudicot karyotype²⁹. The genome of A. thaliana supported two recent whole-genome duplication events (α-WGD and β-WGD) and one triplication event (γ-WGT) that gave rise to much of the eudicot clade³⁰. At least 109 fission/fusion events occurred in the five chromosomes of A. thaliana that evolved from the proposed paleohexaploid ancestor. A minimum of 17 chromosomal fissions and 29 chromosomal fusions were necessary for C. sinensis to reach its current structure of nine chromosomes, and 19 fissions and 25 fusions were necessary for Xanthoceras sorbifolia to reach 15 modern chromosomes. However, Z. bungeanum experienced a much more complex evolutionary history with a lineage-specific WGD (η-WGD, Fig. 2B), in addition to the shared ancestral γ-WGT. We speculated that Z. bungeanum might have experienced at least 98 chromosomal fissions and a minimum of 72 chromosomal fusions to reach its present karyotype of 68 chromosomes (Fig. 2C), indicating a high level of genome reconstruction in Z. bungeanum.

Repetitive sequence expansions led to the large genome size in Z. bungeanum

The assembled genome size of Z. bungeanum (4.23 Gb) is approximately tenfold larger than that of its close relative C. sinensis (~0.38 Gb), despite sharing considerably conserved syntenic blocks (Fig. S8, Table S13). In fact, the size of the Z. bungeanum genome is the third largest among sequenced dicots thus far and is only smaller than that of tobacco³¹ and chickpea³² (Fig. S9). We identified and masked 3.78 Gb of the assembly as repetitive elements, which constituted ~89% of the Z. bungeanum genome. Among these elements, LTR retrotransposons were the most abundant transposable elements (TEs), of which Copia elements (43.04%) were a relatively larger component of the repeat landscape than Gypsy elements (29%) (Fig. 3A, Table S14).

**Fig. 3: Comparisons of transposable element (TE) compositions in the *Z. bungeanum* and *C. sinensis* genomes.**

Similar to other plants, the majority (97.4%) of TEs were located in intergenic regions rather than in exons and introns (Fig. S10). To trace the evolutionary dynamics of TEs, we investigated the insertion dates of Copia and Gypsy elements in Z. bungeanum and C. sinensis. A peak of increased insertion activity for both Copia and Gypsy appeared at ~6.41 MYA (Fig. 3B). Specifically, two types of Copia elements were dominant and contributed the most to Z. bungeanum genome expansion (Fig. 3C, Fig. S11). Compared with C. sinensis, a number of diverse and young LTR subfamilies were present in the Z. bungeanum genome (Fig. 3B), along with numerous species-specific LTRs (Fig. S11B, C). Of the identified TEs, only 19.59% were inherited from ancestral repeats, whereas 71.25% of the lineage-specific TEs emerged during genome expansion (Fig. S12).

Genomic basis of the fruit quality of Z. bungeanum

The quality of Z. bungeanum fruit is determined by the numbing and tingling taste, fragrance, and appearance, corresponding to three major characteristic constituents: alkamides, terpenes, and anthocyanidins. Here, we investigated potential molecular mechanisms associated with Z. bungeanum fruit traits through a comprehensive comparative transcriptome analysis at different fruit development stages.

Insights into sanshool biosynthesis

Sanshools are synthesized from two direct precursor substrates, an unsaturated fatty acid moiety and propanamine^4,33, in a reaction catalyzed by a potential acetyltransferase (NAF) (Fig. 4A). We identified 24 sanshool-like compounds from the pericarp of Z. bungeanum, 13 of which were recently discovered (Fig. S13, Table S15)^34,35, and we found that the fatty acid moiety is often a 12 C or 14 C unsaturated fatty acyl-CoA (Fig. S13), which is biosynthesized by acyl-ACP thioesterase and fatty acid desaturase (Fig. 4A). The amines are biosynthesized in two steps: a valine decarboxylation reaction through branched chain amino acid (BCAA) decarboxylase and a hydroxylation reaction to produce 2-hydroxy-2-methylpropanamine through a cytochrome P450 hydroxylase (Fig. 4A).

**Fig. 4: The metabolic pathways and protein families for sanshool biosynthesis.**

Gene family expansion may be involved in the biosynthesis of a large number of different sanshools. We found seven expansive gene families involved in the biosynthesis of unsaturated fatty acyl-CoA, which included acetyl-CoA carboxylase carboxyl transferase subunit alpha (AccA, 16 genes), 3-oxoacyl-[acyl-carrier-protein] synthase II (FabF, 21 genes), fatty acyl-ACP thioesterase B (FATB, 22 genes), soluble fatty acid desaturase (FAB2, 14 genes), two classes of membrane-bound fatty acid desaturases (FADs, 24 genes), and long chain acyl-CoA synthetase (ACSL, 28 genes) (Fig. 4B). The abundant acyl-ACP thioesterase B in Z. bungeanum, which is approximately 13-fold higher than that of C. sinensis, could provide a variety of fatty acid precursors for sanshool biosynthesis (Fig. S14, Table S16).

Regarding the biosynthesis of propanamine, three of the four gene families involved in valine biosynthesis were significantly expanded; however, the mechanism of action of valine decarboxylase and propanamine hydroxylase in Z. bungeanum is still unclear. We analyzed all possible amino acid decarboxylases and found that 6 out of 17 gene families identified in Z. bungeanum were significantly expanded. Among them, a gene family annotated as group II pyridoxal-dependent decarboxylase is the ortholog of the verified VDC in Echinacea purpurea³³. We analyzed two kinds of N-acetyltransferases, BAHD acyltransferase and Gcn5-related N-acetyltransferase, and found that there were 11 gene families of BAHD acyltransferases (158 genes) that were significantly expanded in Z. bungeanum compared to only two expanded gene families of Gcn5-related N-acetyltransferase (12 genes). This result implied that the potential N-acetyltransferase for sanshool biosynthesis may be a BAHD acyltransferase (Fig. S15), similar to capsaicin synthase in Capsicum annuum³⁶.

The abundance of sanshools gradually increased in the pericarp during postanthesis³⁷. As expected, the level of the typical hydroxy-β-sanshool also increased with fruit development (Fig. 4C, top histogram). To examine the correlation between the gene expression and abundance of sanshools, we constructed a coexpression network using RNA-Seq data from seven fruit development stages. Gene expression profiles for 2,752 metabolic genes were clustered into five modules (Fig. 4C, left heatmap, Fig. S16, Table S17). Furthermore, KEGG metabolic pathway enrichment analysis was performed for each module (Fig. 4C, right panel, Table S18). Both the fatty acid pathway and branched chain amino acid pathway were observed to be involved in the biosynthesis of sanshools. We observed that there was an enrichment of saturated and unsaturated fatty acid biosynthesis in module 4, with an increase in gene expression in the early stages but a reduction in the later stages. These results demonstrated that fatty acids were biosynthesized mainly during the intermediate stage of pericarp development. We also found that valine, leucine, and isoleucine biosynthesis was significantly enriched in module 3, in which gene expression increased throughout pericarp development (Fig. 4C). The reinforced biosynthesis of branched-chain amino acids can afford amine precursors for the synthesis of sanshools.

We further examined the gene expression profile involved in sanshool biosynthesis and their orthologous genes in Citrus, which does not produce a tingling sensation. We found 23,603 orthologous pairs between Z. bungeanum and C. sinensis, of which 2,874 pairs showed significantly higher expression levels in Z. bungeanum pericarps than in C. sinensis. Among these, 38 of 193 pairs related to the sanshool biosynthesis pathway showed significantly higher expression levels in the pericarp of Z. bungeanum (Fig. 4D, Fig. S17), and the proportion was significantly higher than that in the background (P = 0.002). The enrichment of highly expressed genes involved in sanshool biosynthesis not only indicates the underlying genetic basis for the accumulation of sanshools in Z. bungeanum but also provides a potential gene set for the identification of undetermined steps in its biosynthesis pathway.

Characteristics of anthocyanidin synthase (ANS) in Z. bungeanum

The Z. bungeanum cultivar ‘DaHongPao’ is renowned for its characteristic bright red pericarp during fruit maturation. Previous studies have suggested that flavonoids, such as anthocyanins, might be involved in the production of red pigments³⁸. A single copy of anthocyanidin synthase (ANS), which catalyzes the key step in anthocyanin biosynthesis, was retained in both the Arabidopsis and C. sinensis genomes, whereas it was expanded to five copies in the Z. bungeanum genome (Fig. 5A). The expression levels of the five ANS genes increased continuously during the later stages of fruit development (Fig. 5B). In particular, the expression of EVM0019607.1 was dramatically increased in the last stage of pericarp development and was approximately 28-fold higher than the average expression of all genes. However, the unique ANS in C. sinensis was not expressed during pericarp development (Fig. 5B). The key positive regulator of anthocyanin biosynthesis Ruby1³⁹, which encodes a MYB transcription factor, showed strongly divergent expression in Z. bungeanum and C. sinensis. Its orthologous genes in Z. bungeanum (EVM0033809.1 and EVM0052497.1) showed increased expression at the later stages of pericarp development (Fig. S18) compared with that of C. sinensis. Therefore, our integrative genomic and transcriptomic analyses suggested that changes in the gene expression and expansion of anthocyanidin synthase have shaped anthocyanin biosynthesis, resulting in the bright-red appearance of the pericarp during fruit maturation. In addition to anthocyanidin synthase, we also found that approximately 80% of the genes involved in flavonoid biosynthesis showed significantly higher expression in the pericarp of Z. bungeanum than in that of C. sinensis (Fig. 5C).

**Fig. 5: Evolutionary analysis and differentially expressed genes involved in the anthocyanidin synthase (ANS) pathway at seven fruit developmental stages in *Z. bungeanum*.**

Characteristics of terpene synthases (TPSs) in Z. bungeanum

Volatile oils, such as monoterpenes and sesquiterpenes, contribute to the characteristic aromas of Zanthoxylum^3,40 and Citrus⁴¹ in Rutaceae. Most terpenes are produced by terpene synthases (TPSs). A total of 70 TPS genes, assigned to eight gene families, were identified in the Z. bungeanum genome (Fig. S19, Table S19). The families TPS_0001 (producing monoterpenes, 31 genes) and TPS_0011 (producing sesquiterpenes, 23 genes) were significantly expanded in Z. bungeanum and C. sinensis compared to Arabidopsis (Fig. 6A, Tables S20 and S21). Furthermore, expression profile analysis of these 70 TPS genes showed that the expression levels of monoterpenoid synthases (TPS_0001) and sesquiterpenoid synthases (TPS_0011) were obviously higher than those of the other TPSs (Fig. 6B, Table S20). In particular, the gene EVM0049874.1, which was identified as beta-phellandrene synthase in Japanese pepper (Z. piperitum)⁴², had the highest expression level among all TPSs (Fig. 6B). This result is consistent with the fact that beta-phellandrene is the major accumulated product in the secretory cavities of the leaf and pericarp. Additionally, a previous study reported that the gene expression of beta-phellandrene synthase was detected only during the early stages of cavity development, while the formation of volatile terpenes occurred at a constant rate throughout the expansion of secretory cavities⁴². Similarly, our study indicated that the expression level of the beta-phellandrene synthase gene was dramatically decreased at the fruit maturation stage (Fig. 6B). A similar pattern was also observed for the monoterpenoid synthases of C. sinensis (Fig. S20), which are mainly used to produce D-limonene in the pericarp. Previous studies have indicated that the down-regulation of D-limonene synthase in orange fruit can induce resistance against fungal diseases^43,44.

**Fig. 6: Evolutionary analysis and differentially expressed genes involved in the terpene synthase (TPS) pathway at seven fruit developmental stages in *Z. bungeanum*.**

Discussion

Although several species of Zanthoxylum have a long history of cultivation and application in traditional Chinese medicine and are also popular as food additives, scientific research has been hampered by the absence of genetic resources. Here, we present a genome assembly for Z. bungeanum, which has a larger genome size than most sequenced plants. Assembling this genome was highly challenging due to its high heterozygosity (2.87%), striking TE expansion (~89%), and dramatically numerous chromosomes (68 chromosomes); nevertheless, our assembly covers 95.5% of the Z. bungeanum genome (~4.23 Gb). This assembled Zanthoxylum reference genome will reveal novel evolutionary events that have not been uncovered in related plant taxa until now.

The genome size was larger and there were more genes and chromosomes in Z. bungeanum than in most sequenced dicots. Phylogenetic analysis indicated that Z. bungeanum and C. sinensis probably diverged approximately 35.3 MYA, which is consistent with the divergence time, 36.5 to 37.7 MYA, estimated by nuclear and chloroplast genes, respectively⁴⁵. It has been well documented that extensive amplification of TEs and WGD events have resulted in significant genome expansion in plants^46,47,48. The Z. bungeanum genome underwent a recent lineage-specific η-WGD event at approximately 26.8 MYA, which distinguished the genome of Zanthoxylum from its close relative Citrus. Following WGD, the return to a genetically diploid state was associated with numerous chromosomal fissions and fusions, finally resulting in 68 structurally diverse chromosomes. However, the rediploidization processes could not conceal the WGD event, and a dosage of duplicated genes was retained. Therefore, this lineage-specific η-WGD event may have been involved in driving genome expansion, the proliferation of TEs, and chromosomal rearrangement.

Similar to garlic, whose percentage of repetitive elements (91.3%) is the highest among all sequenced plant genomes⁴⁹, more than 89% (∼3.8 Gb) of the Z. bungeanum genome assembly is composed of different transposable elements (TEs), which is slightly higher than the TEs (∼81% of 3.3 Gb genome size) in hot pepper⁵⁰. Clearly, rapid amplification of retrotransposons contributed much more to the genome expansion in Z. bungeanum (72.04%) than that in C. sinensis (18%) (Fig. 3B) but paralleled the genomic topology of the maize genome (75%)⁴⁸. In contrast to other dicots^51,52, Copia elements constituted the predominant component of LTR elements. This scenario is quite different from that of most sequenced dicots, such as hot pepper, in which Gypsy elements are the predominant components of LTR retrotransposons^36,50, but is similar to that of C. sinensis²¹. Most TEs were greatly expanded in Z. bungeanum after the speciation event, and this species-specific process led to the large extant genome size of Z. bungeanum. On the other hand, active TEs might have triggered the occurrence of fission and fusion events in Zanthoxylum chromosomes⁵³. Comparative analysis of Copia and Gypsy elements between Z. bungeanum and C. sinensis showed that the LTRs in the former were young and accumulated separately from those of C. sinensis, implying that active transposition of LTRs in Zanthoxylum occurred specifically after its split from Citrus species and that their expansions were also responsible for Z. bungeanum genome expansion. Overall, these results showed that a recent η-WGD event occurred, followed by a more recent burst of TE insertions. Therefore, the Z. bungeanum-specific WGD event combined with recent TE bursts contributed to the extraordinarily large genome size and the evolution of unique Zanthoxylum traits. In addition, frequent fusion/fission events have also destroyed the ancestral genome state, broken the chromosomes, and finally yielded a large number of reconnected chromosomes (Fig. 2C).

Several studies have confirmed that gene expansion can deeply reshape the breadth and abundance of secondary metabolites in plants^50,51. Evolution of the capsaicinoid biosynthetic pathway in hot pepper involved multiple rounds of unequal duplication of key genes (i.e., capsaicin synthase) along with changes in their expression after speciation³⁶, and this pattern also holds true in Z. bungeanum. In this study, we identified lineage-specific genes that likely control the quality of Z. bungeanum, in particular, genes encoding enzymes relevant to sanshool, anthocyanin, and beta-phellandrene biosynthesis. Our comparative analyses indicated an obvious expansion of genes encoding acyl-ACP thioesterase, NAF, ANS, and TPS, which tend to be coexpressed during fruit development. Therefore, gene expansion and subsequent neofunctionalization in the Zanthoxylum genome may be a major driving force for its peculiar biological characteristics. Additionally, by integrating genomic and transcriptomic analyses, we clarified the evolutionary processes of many enzymes involved in the biosynthetic pathways of specific secondary metabolites in Z. bungeanum, which are the factors determining the quality of Z. bungeanum.

The Z. bungeanum reference genome reported here offers unprecedented insights into the genome dynamics of the spice crop and will continue to provide a strong foundation for further studies not only on Z. bungeanum but also on other Rutaceae species. A combination of comparative genomics, metabolic engineering, and transgenic approaches will help reveal the molecular mechanisms of secondary metabolites, thereby expediting the processes of crop improvement in the future.

Experimental procedures

PacBio sequencing

An improved CTAB method was used to extract genomic DNA. Genomic DNA was sheared into 20 kb fragments using a g-TUBE device (Covaris Inc., Woburn, MO, USA). The sheared DNA was purified and concentrated using Agencourt Ampure XP beads (Beckman Coulter Inc., Pasadena, CA, USA) and further used for single-molecule real-time (SMRT) bell preparation according to the manufacturer’s protocol (Pacific Biosciences, Menlo Park, CA, USA; 20 kb template preparation kit) using the BluePippin size selection protocol (Sagescience, Beverly, MA, USA). After size selection, the isolated SMRT bell fractions were purified using Ampure XP beads, and then they were used for primer (V3) and polymerase (2.0) binding according to the manufacturer’s binding calculator (Pacific Biosciences). Single-molecule sequencing was performed on a PacBio Sequel system, and only the subreads equal to or longer than 500 bp were used for subsequent genome assembly.

Illumina sequencing

We constructed seven libraries with 270 bp insert fragments for Z. bungeanum following Illumina’s protocol (Illumina, San Diego, CA, USA). The sequencing adapters and contaminated reads (mitochondrial, bacterial, and viral sequences) were removed from the raw Illumina reads by alignment to the NCBI-NR database using BWA v0.7.13⁵⁴ with default parameters. FastUniq v1.1⁵⁵ was used to remove the duplicated read pairs, and low-quality reads were filtered satisfying the following conditions: (1) reads with ≥10% unidentified nucleotides (N), (2) reads with >10 nucleotides aligned to the adapter, allowing ≤10% mismatches, and (3) reads with >50% bases having a Phred quality <5.

Hi-C sequencing

According to the Hi-C procedure, nuclear DNA from the leaves of Z. bungeanum was cross-linked and then cut with the restriction enzyme Dpn II, leaving pairs of distally located but physically interacting DNA molecules attached to one another. The sticky ends of these digested fragments were biotinylated and then ligated to each other to form chimeric circles. Biotinylated circles, which are chimeras of the physically associated DNA molecules from the original cross-linking, were enriched, sheared, and sequenced using the Illumina HiSeq X Ten platform with 150 bp paired-end reads. As a result, we obtained a total of 486.7 Gb clean Illumina reads.

Genome assembly

The full PacBio long reads were converted to fasta format. First, we used NextDenovo (v2.3) (https://github.com/Nextomics/NextDenovo) to generate a draft genome assembly with default parameters for PacBio reads only. We then used NextPolish (v2.0)²⁴ to polish the draft genome with both long and short reads to obtain the corrected genome. This was followed by processing using purge_dups to purge the haplotigs and error-containing fragments. Subsequently, contigs were clustered with hierarchical clustering of the Hi-C data. To anchor scaffolds onto chromosomes, the Hi-C sequencing data were aligned to the assembly by BWA (aln mode) using the default parameters⁵⁴, and valid contacts were detected. In total, 224,908,615 valid interaction read pairs were used for Hi-C scaffolding. Based on the valid Hi-C interaction read pairs, 16,615 contigs were clustered into 68 pseudochromosomes using ALLHiC^27,28, of which 16,611 contigs with a total length of 4,124,904,629 bp were ordered and oriented within each group. The gap percentage in the final assembly was only 0.04%.

Genome quality assessment

The completeness of the assembly was checked by mapping 2,270 benchmarking universal single-copy orthologs (BUSCOs) and 458 core eukaryotic genes (CEGs) to the genomes using BUSCO v3.0.2b⁵⁶ and CEGMA v2.5⁵⁷, respectively. Additionally, we used the LTR assembly index (LAI)⁵⁸ to evaluate the completeness of the assembly.

Repeated sequence prediction

The repeat components in Z. bungeanum assembly were first estimated by building a de novo repeat library by employing the programs LTR-FINDER⁵⁹, MITE-Hunter⁶⁰, RepeatScout v1.0.5⁶¹, and PILER-DF⁶², and the output results were merged together and classified using PASTEClassifier v1.0⁶³. This de novo constructed database together with the Repbase database v20.01⁶⁴ were used to create the final repeat library. Repeat sequences in Z. bungeanum were identified and classified using the RepeatMasker program v4.0.6⁶⁵. The LTR family classification criterion was defined based on 5′ LTR sequences of the same family sharing at least 80% identity over at least 80% of their length. The expansion history of transposons was estimated by computing the divergence of the transposon Copia from the corresponding consensus sequence in the repeat library according to the RepeatMasker output and then calculating the percentage of transposons at different divergence levels.

LTR-RT analysis

Long terminal repeat retrotransposons (LTR-RTs) were identified using LTR_retriever. We identified a total of 53,470 intact LTR-RTs (the output file with the name “.pass.list”). Then, we extracted the internal regions of all intact LTR-RTs and conducted BLASTX searches into the nonredundant LTR-RT library (.LTRlib.fa). By analyzing the best hits from all intact LTR-RTs to the nonredundant LTR-RT library, the internal regions of all intact LTR-RTs can map up to 3300 LTR-RTs in the nonredundant LTR-RT library.

Protein-coding gene prediction

We used de novo protein homology and RNA-Seq approaches for protein-coding gene prediction. In detail, Genscan v1.0⁶⁶, Augustus v2.5.5⁶⁷, GlimmerHMM v3.0.1⁶⁸, GeneID v1.3, and SNAP⁶⁹ were used to perform de novo gene prediction; the alignment of the homologous peptides from Arabidopsis thaliana (The Arabidopsis Information Resource), Oryza sativa (Phytozome v12.1), and Citrus reticulata (http://citrus.hzau.edu.cn/orange/index.php) to our assemblies was used to identify homologous genes with GeMoMa v1.4.2⁷⁰; the RNA-Seq reads were assembled into contigs and the de novo assembly yielding unigenes was performed using Trinity; and the resulting unigenes were aligned to the repeat-masked assemblies using BLAT⁷¹. Subsequently, the gene structures of the BLAT alignment results were modeled using PASA⁷², and the protein-coding regions were identified using TransDecoder v3.0.1 (https://github.com/TransDecoder/TransDecoder/) and GeneMarkS-T⁷³. Finally, consensus gene models were generated by integrating de novo predictions, protein alignments, and transcript data using EVidenceModeler⁷⁴. Annotation of the predicted genes was performed by BLAST searches against a series of nucleotide and protein sequence databases, including KOG⁷⁵, KEGG⁷⁶, NCBI-NR, and TrEMBL⁷⁷, with an E-value cutoff of 1e-5. Gene Ontology (GO) for each gene was assigned by Blast2GO⁷⁸ against the NCBI database.

Noncoding RNA prediction

Noncoding RNAs play important roles in a variety of processes and include the genes encoding ribosomal RNAs (rRNAs), transfer RNAs (tRNAs), and microRNAs (miRNAs). The rRNA fragments were identified by aligning the rRNA template sequences against the Pfam database v32.0⁷⁹ using BLAST with an E-value of 1e-10 and identity cutoff of 95% or more. The tRNAScan-SE algorithms⁸⁰ with default parameters was applied to predict tRNA genes. The miRNA genes were predicted using INFERNAL v1.1⁸¹ against the Rfam database v14.0⁸² with a cutoff score of 30 or more. The minimum cutoff score was based on the settings that yielded a false-positive rate of 30 bits.

Comparative genomics analyses

Protein sequences of Z. bungeanum, Citrus sinensis, Arabidopsis thaliana, Amborella trichopoda, Piper nigrum, Zea mays, Oryza sativa, Papaver somniferum, Vitis vinifera, Dimocarpus longan, Brassica napus, Gossypium hirsutum, Arachis hypogaea, Cucumis sativus, Sesamum indicum, Capsicum annuum, and Nicotiana tabacum were used for all BLASTP analyses. The results were analyzed using OrthoMCL software⁸³ with an MCL inflation of 1.5 to identify gene family clusters. Single-copy gene clusters shared by all 17 species were used to construct a phylogenetic tree using PhyML v3.0⁸⁴. The divergence time was estimated using the MCMCtree implemented in PAML package v4.9⁸⁵. Calibration times were obtained from the TimeTree database (http://www.timetree.org/). Homologous blocks were detected using Mcscan v1.1⁸⁶. The K_s values of the blocks were calculated using the HKY model⁸⁷. According to the divergence time between Z. bungeanum and C. sinensis derived from the phylogenetic tree (Fig. 2A, 35.3 MYA), the synonymous substitution rate is 3.92 × 10⁻⁹ synonymous substitutions yr⁻¹ (T = Ks/2λ and λ = 0.277/2 × 35.3 = 3.92E-9). The Zanthoxylum-specific WGD event date was obtained based on the synonymous (K_s) substitutions calculation with λ = 3.92E-9.

Expansion and contraction of OrthoMCL-derived gene clusters was determined using CAFÉ v2.1 and was based on changes in gene family size in the inferred phylogenetic history. KEGG and GO annotations of the gene family were completed by aligning the genes to the KEGG database and NCBI nonredundant database using BLASTP with an E value of 1e-5. Blast2GO was used to obtain the associated GO terms. The enrichment score was defined as a hypergeometric test value.

Synteny analysis

The genome synteny between and within species was analyzed via all-against-all BLASTP searches of protein sequences (with an E-value cutoff of 1e-5). Collinear blocks containing at least 10 genes (-s 10) and a maximum of 25 gaps (genes) between two proximal orthologs within a block (-m 25) were identified using Mcscan v1.1⁸⁶. Synteny was searched for by comparing the Z. bungeanum genome with the genomes of C. sinensis and V. vinifera.

Karyotype evolution analysis of Rutaceae

We performed collinearity analysis for each species within the set containing Z. bungeanum, Xanthoceras sorbifolia⁸⁸, C. sinensis²¹ and Arabidopsis⁸⁹ and Vitis vinifera²⁹ using MCScanX⁸⁶, and the syntenic blocks were identified based on all-versus-all BLAST alignments included in the JCVI package⁹⁰ with default parameters. The distribution of seven ancestral eudicot chromosomal lineages for each chromosome in each species was depicted by the syntenic blocks between the ancestral chromosomes of grape²⁹ as described in Bolot et al.⁹¹ and Murat et al.⁹² and those of the detected species. Speciation event dates were obtained based on the synonymous (K_s) substitutions calculation (divergence time = K_s/2 × r) with r = 6.5 × 10⁻⁹ (ref. ⁹³).

Coexpression analysis

Based on quality scores, the clean reads from the transcriptome data obtained from pericarps at seven developmental stages were trimmed using the quality trimming program Btrim⁹⁴ and aligned to the Z. bungeanum reference assembly using TopHat v2.21⁹⁵. Cufflinks v2.2.1⁹⁵ was used to assemble the mapped reads for each sample. We used the fragments per kilobase of exon model per million mapped reads (FPKM) as the normalized gene expression level. We constructed a coexpression network using the cluster function in MATLAB. First, 2,752 metabolic genes (average FPKM > 5) were selected based on the KEGG annotation. The standard of FPKM > 5 was selected because the expression profile of genes with low expression is susceptible to sequencing errors. Second, based on the Spearman correlation between genes, the 2,752 metabolic genes were clustered into five subnetwork modules (Fig. 4C and Fig. S16). KEGG enrichment analysis was conducted for each module to understand the relationship between the enriched pathways and gene expression patterns. The p values were calculated by a hypergeometric test and adjusted using the Benjamini–Hochberg procedure.

Comparison of transcriptomes between Z. bungeanum and C. sinensis

Gene expression in C. sinensis was referenced from a previous study⁹⁶ in which transcriptome sequencing of the pericarps was performed at 90, 120, 150, 180, and 210 days after full bloom. To determine the gene expression differences between Z. bungeanum and C. sinensis pericarps, we first identified 14,675 orthologous genes between the two species. Then, according to gene expression levels with equal medians in the 14,675 orthologs of the two species, the gene expression of C. sinensis was normalized by dividing by 4.78.

Analysis of gene family expansion

The protein families that expanded in Z. bungeanum compared to C. sinensis were considered to be expanded in Zanthoxylum (p < 0.05). The KEGG annotations of C. sinensis and A. thaliana were downloaded from the KEGG website (https://www.genome.jp/kegg/). KEGG annotation of Z. bungeanum was performed using the KEGG Automatic Annotation Server (KAAS) platform. The acyl-ACP thioesterases and acetyltransferases in Z. bungeanum, C. sinensis, and A. thaliana were predicted using hmmsearch in conjunction with the acetyltransferase and thioesterase family hmm models PF01643 and PF02458 (E-value < 1e-6) from Pfam^79,97. Then, we tested whether the gene number of the two gene families in one plant was significantly higher than that in another plant by comparing the background gene number between the two plants. The p values were calculated by a hypergeometric test and adjusted using the Benjamini–Hochberg procedure.

Annotation and analysis of terpene synthases

The TPSs in the Z. bungeanum genome were predicted using hmmsearch in conjunction with the terpene synthase family hmm model PF03936 (E-value < 1e-6) from Pfam^78,96. To analyze the evolution of the TPS gene family in Z. bungeanum, C. sinensis, and A. thaliana, the 158 (70 + 55 + 33) TPS proteins were further classified into 10 TPS families and 41 TPS subfamilies (Table S21) based on three criteria: (1) the proteins in a family or subfamily had relatively closer phylogenetic relationships in the phylogenetic tree constructed by the alignment of all TPS proteins; (2) the identity between two protein sequences in a family was higher than 45%; and (3) the identity between two protein sequences in a subfamily was higher than 60%.

Accession numbers

The Z. bungeanum genome, annotation, and raw data are deposited in NCBI under BioProject ID PRJNA524242 and accession number SKCR00000000.

References

Kubitzki, K., Kallunki, J., Duretto, M. & Wilson, P. Rutaceae. In The families and genera of vascular plants, flowering plants eudicots: sapindales, cucurbitales, myrtaceae (ed. Kubitzki K.) 276–356 (Springer Verlag, 2011).
Zhang, M. et al. Zanthoxylum bungeanum Maxim. (Rutaceae): a systematic review of its traditional uses, botany, phytochemistry, pharmacology, pharmacokinetics, and toxicology. Int. J. Mol. Sci. 18, 2172 (2017).
Article PubMed Central CAS Google Scholar
Yang, X. Aroma constituents and alkylamides of red and green Huajiao (Zanthoxylum bungeanum and Zanthoxylum schinifolium). J. Agr. Food Chem. 56, 1689–1696 (2008).
Article CAS Google Scholar
Boonen, J. et al. Alkamid database: chemistry, occurrence and functionality of plant N-alkylamides. J. Ethnopharmacol. 142, 563–590 (2012).
Article CAS PubMed Google Scholar
Greger, H. Alkamides: structural relationships, distribution and biological activity. Planta Med. 50, 366–375 (1984).
Article CAS PubMed Google Scholar
Yasuda, I., Takeya, K. & Itokawa, H. Distribution of unsaturated aliphatic acid amides in Japanese Zanthoxylum species. Phytochemistry 21, 1295–1298 (1982).
Article CAS Google Scholar
Matthias, B., Stark, T. D., Corinna, D., Sofie, L. S. & Thomas, H. All-trans-configuration in Zanthoxylum alkylamides swaps the tingling with a numbing sensation and diminishes salivation. J. Agric. Food Chem. 62, 2479–2488 (2014).
Article CAS Google Scholar
Xiong, Q., Dawen, S., Yamamoto, H. & Mizuno, M. Alkylamides from pericarps of Zanthoxylum bungeanum. Phytochemistry 46, 1123–1126 (1997).
Article CAS Google Scholar
Devkota, K. P. et al. Isobutylhydroxyamides from the pericarp of Nepalese Zanthoxylum armatum inhibit NF1-defective tumor cell line growth. J. Nat. Prod. 76, 59–63 (2013).
Article CAS PubMed Google Scholar
Rong, R. et al. Anesthetic constituents of Zanthoxylum bungeanum Maxim. pharmacokinetic study. J. Sep. Sci. 39, 2728–2735 (2016).
Article CAS PubMed Google Scholar
Tsunozaki, M. et al. A ‘toothache tree’ alkylamide inhibits Aδ mechanonociceptors to alleviate mechanical pain. J. Physiol. 591, 3325–3340 (2013).
Article CAS PubMed PubMed Central Google Scholar
Artaria, C., Maramaldi, G., Bonfigli, A., Rigano, L. & Appendino, G. Lifting properties of the alkamide fraction from the fruit husks of Zanthoxylum bungeanum. Int. J. Cosmet. Sci. 33, 328–333 (2011).
Article CAS PubMed Google Scholar
Yamazaki, E., Inagaki, M., Kurita, O. & Inoue, T. Antioxidant activity of Japanese pepper (Zanthoxylum piperitum DC.) fruit. Food Chem. 100, 171–177 (2007).
Article CAS Google Scholar
Li, K. et al. Zanthoxylum bungeanum essential oil induces apoptosis of HaCaT human keratinocytes. J. Ethnopharmacol. 186, 351–361 (2016).
Article CAS PubMed Google Scholar
Patiño, L., Prieto, R. & Cuca, S. Zanthoxylum genus as potential source of bioactive compounds. In Bioactive Compounds in Phytomedicine (ed. Rasooli I.) 185–218 (InTech, 2012).
Tang, M. et al. A novel drug candidate for alzheimer’s disease treatment: gx-50 derived from Zanthoxylum bungeanum. J. Alzheimers Dis. 34, 203–213 (2013).
Article CAS PubMed Google Scholar
Zhu, H., Huang, Y., Ji, X., Su, T. & Zhou, Z. Continuous existence of Zanthoxylum (Rutaceae) in Southwest China since the Miocene. Quatern. Int. 392, 224–232 (2016).
Article Google Scholar
Chinese Pharmacopoeia Commission. Chinese Pharmacopoeia (in Chinese) Shanghai: Science and Technology Press of Shanghai. 275 (1977).
Chinese Pharmacopoeia Commission. Chinese Pharmacopoeia (in Chinese) Shanghai: Science and Technology Press of Shanghai. 149 (2010).
Chinese Pharmacopoeia Commission. Chinese Pharmacopoeia (in Chinese) Shanghai: Science and Technology Press of Shanghai. 159–160 (2015).
Xu, Q. et al. The draft genome of sweet orange (Citrus sinensis). Nat. Genet. 45, 59–66 (2013).
Article CAS PubMed Google Scholar
Wang, X. et al. Genomic analyses of primitive, wild and cultivated citrus provide insights into asexual reproduction. Nat. Genet. 49, 765–772 (2017).
Article CAS PubMed Google Scholar
Wu, G. A. et al. Genomics of the origin and evolution of Citrus. Nature 554, 1–20 (2018).
Article CAS Google Scholar
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long read assembly. Bioinformatics 36, 2253–2255 (2019).
Article CAS Google Scholar
Zeng, L. et al. Whole genomes and transcriptomes reveal adaptation and domestication of pistachio. Genome Biol. 20, 79 (2019).
Article PubMed PubMed Central Google Scholar
Yan, L. et al. The genome of Dendrobium officinale illuminates the biology of the important traditional Chinese orchid herb. Mol. Plant 8, 922–934 (2015).
Article CAS PubMed Google Scholar
Zhang, J. et al. Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L. Nat. Genet. 50, 1565–1573 (2018).
Article CAS PubMed Google Scholar
Zhang, X., Zhang, S., Zhao, Q., Ming, R. & Tang, H. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat. Plants 5, 833–845 (2019).
Article CAS PubMed Google Scholar
Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007).
Article CAS PubMed Google Scholar
Jiao, Y. et al. Ancestral polyploidy in seed plants and angiosperms. Nature 473, 97–100 (2011).
Article CAS PubMed Google Scholar
Sierro, N. et al. The tobacco genome sequence and its comparison with those of tomato and potato. Nat. Commun. 5, 3833–3833 (2014).
Article CAS PubMed Google Scholar
Bai, R. et al. Resequencing of 429 chickpea accessions from 45 countries provides insights into genome diversity, domestication and agronomic traits. Nat. Genet. 51, 857–864 (2019).
Article CAS Google Scholar
Rizhsky, L. et al. Integrating metabolomics and transcriptomics data to discover a biocatalyst that can generate the amine precursors for alkamide biosynthesis. Plant J. 88, 775–793 (2016).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. Isobutylhydroxyamides from Zanthoxylum bungeanum and their suppression of NO production. Molecules 21, 1416 (2016).
Article PubMed Central CAS Google Scholar
Wang, Y. et al. Isolation, structural characterization and neurotrophic activity of alkylamides from Zanthoxylum bungeanum. Nat. Prod. Commun. 12, 1121–1124 (2017).
Google Scholar
Kim, S. et al. Genome sequence of the hot pepper provides insights into the evolution of pungency in Capsicum species. Nat. Genet. 46, 270–278 (2014).
Article CAS PubMed Google Scholar
Sugai, E., Morimitsu, Y. & Kubota, K. Quantitative analysis of sanshool compounds in Japanese pepper (Zanthoxylum piperitum DC.) and their pungent characteristics. Biosci. Biotech. Bioch. 69, 1958–1962 (2005).
Article CAS Google Scholar
De Pascualteresa, S. & Sanchezballesta, M. T. Anthocyanins: from plant to health. Phytochem. Rev. 7, 281–299 (2008).
Article CAS Google Scholar
Huang, D. et al. Subfunctionalization of the Ruby2–Ruby1 gene cluster during the domestication of citrus. Nat. Plants 4, 930–941 (2018).
Article CAS PubMed Google Scholar
Gong, Y. et al. Chemical composition and antifungal activity of the fruit oil of Zanthoxylum bungeanum Maxim. (Rutaceae) from China. J. Essent. Oil Res. 21, 174–178 (2009).
Article CAS Google Scholar
Njoroge, S. M., Koaze, H., Karanja, P. N. & Sawamura, M. Volatile constituents of redblush grapefruit (Citrus paradisi) and pummelo (Citrus grandis) peel essential oils from Kenya. J. Agr. Food Chem. 53, 9790–9794 (2005).
Article CAS Google Scholar
Fujita, Y. et al. Biosynthesis of volatile terpenes that accumulate in the secretory cavities of young leaves of Japanese pepper (Zanthoxylum piperitum): isolation and functional characterization of monoterpene and sesquiterpene synthase genes. Plant Biotechnol. 34, 17–28 (2017).
Article CAS Google Scholar
Rodríguez, A. et al. Terpene down-regulation triggers defense responses in transgenic orange leading to resistance against fungal pathogens. Plant Physiol. 164, 321–339 (2014).
Article PubMed CAS Google Scholar
Rodríguez, A. et al. Engineering d-limonene synthase down-regulation in orange fruit induces resistance against the fungus Phyllosticta citricarpa through enhanced accumulation of monoterpene alcohols and activation of defence. Mol. Plant Pathol. 19, 2077–2093 (2018).
Article PubMed PubMed Central CAS Google Scholar
Feng, S. et al. De novo transcriptome assembly of Zanthoxylum bungeanum using Illumina sequencing for evolutionary analysis and simple sequence repeat marker development. Sci. Rep. 7, 16754 (2017).
Article PubMed PubMed Central CAS Google Scholar
Bennetzen, J. L. Mechanisms and rates of genome expansion and contraction in flowering plants. Genetica 115, 29–36 (2002).
Article CAS PubMed Google Scholar
De Peer, Y. V., Maere, S. & Meyer, A. The evolutionary significance of ancient genome duplications. Nat. Rev. Genet. 10, 725–732 (2009).
Article PubMed CAS Google Scholar
Schnable, P. S. et al. The B73 maize genome: complexity, diversity, and dynamics. Science 326, 1112–1115 (2009).
Article CAS PubMed Google Scholar
Sun, X. et al. A chromosome-level genome assembly of garlic (Allium sativum) provides insights into genome evolution and allicin biosynthesis. Mol. Plant 13, 1328–1339 (2020).
Article CAS PubMed Google Scholar
Qin, C. et al. Whole-genome sequencing of cultivated and wild peppers provides insights into Capsicum domestication and specialization. Proc. Natl Acad. Sci. USA 111, 5135–5140 (2014).
Article CAS PubMed PubMed Central Google Scholar
Tang, C. et al. The rubber tree genome reveals new insights into rubber production and species adaptation. Nat. Plants 2, 16073 (2016).
Article CAS PubMed Google Scholar
The, B. T. et al. The draft genome of tropical fruit durian (Durio zibethinus). Nat. Genet. 49, 1633–1641 (2017).
Article CAS Google Scholar
Zhang, J., Yu, C., Krishnaswamy, L., Peterson, T. Transposable elements as catalysts for chromosome rearrangements. In Plant Chromosome Engineering. Methods in Molecular Biology (Methods and Protocols) (ed. Birchler J) 701 (Humana Press, Totowa, NJ, 2011).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler Transform. Bioinformatics 26, 589–595 (2009).
Article CAS Google Scholar
Xu, H. et al. FastUniq: a fast de novo duplicates removal tool for paired short reads. PLoS ONE 7, e52249 (2012).
Article CAS PubMed PubMed Central Google Scholar
Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article CAS PubMed Google Scholar
Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).
Article CAS PubMed Google Scholar
Qu, S., Chen, J. & Jiang, N. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. 46, e126 (2018).
Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, 265–268 (2007).
Article Google Scholar
Han, Y. & Wessler, S. R. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 38, e199 (2010).
Article PubMed PubMed Central CAS Google Scholar
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21, i351–i358 (2005).
Article CAS PubMed Google Scholar
Edgar, R. C. & Myers, E. W. PILER: identification and classification of genomic repeats. Bioinformatics 21, 152–158 (2005).
Wicker, T. et al. A unified classification system for eukaryotic transposable elements. Nat. Revi. Genet. 8, 973–982 (2007).
Article CAS Google Scholar
Bao, W., Kojima, K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile. DNA 6, 11–11 (2015).
Google Scholar
Chen, N. Using repeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 25, 10.11–14.10.14 (2004).
Google Scholar
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Article CAS PubMed Google Scholar
Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, ii215–ii225 (2003).
Article PubMed Google Scholar
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004).
Article CAS PubMed Google Scholar
Blanco, E., Parra, G. & Guigó, R. Using geneid to identify genes. Curr. Protoc. Bioinformatics 4, 4–3 (2007).
Keilwagen, J. et al. Using intron position conservation for homology-based gene prediction. Nucleic Acids Res. 44, e89 (2016).
Article PubMed PubMed Central CAS Google Scholar
Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
Article CAS PubMed PubMed Central Google Scholar
Tang, S., Lomsadze, A. & Borodovsky, M. Identification of protein coding regions in RNA transcripts. Nucleic Acids Res. 43, e78 (2015).
Article PubMed PubMed Central CAS Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, 1–22 (2008).
Article CAS Google Scholar
Tatusov, R. L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41–41 (2003).
Article PubMed PubMed Central Google Scholar
Kanehisa, M. & Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (1999).
Article Google Scholar
Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003).
Article CAS PubMed PubMed Central Google Scholar
Conesa, A. et al. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21, 3674–3676 (2005).
Article CAS PubMed Google Scholar
Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, 222–230 (2014).
Article CAS Google Scholar
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).
Article CAS PubMed PubMed Central Google Scholar
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
Article CAS PubMed PubMed Central Google Scholar
Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A. & Eddy, S. R. Rfam: an RNA family database. Nucleic Acids Res. 31, 439–441 (2003).
Article CAS PubMed PubMed Central Google Scholar
Li, L., Stoeckert, C. J. & Roos, D. S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003).
Article CAS PubMed PubMed Central Google Scholar
Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic. Biol 59, 307–321 (2010).
CAS Google Scholar
Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007).
Article CAS PubMed Google Scholar
Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49 (2012).
Article CAS PubMed PubMed Central Google Scholar
Hasegawa, M., Kishino, H. & Yano, T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174 (1985).
Article CAS PubMed Google Scholar
Liang, Q. et al. The genome assembly and annotation of yellowhorn (Xanthoceras sorbifolium Bunge). GigaScience 8, 1–15 (2019).
Article CAS Google Scholar
Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
Article Google Scholar
Tang, H. et al. Synteny and collinearity in plant genomes. Science 320, 486–488 (2008).
Article CAS PubMed Google Scholar
Bolot, S. et al. The ‘inner circle’ of the cereal genomes. Curr. Opin. Plant biol. 12, 119–125 (2009).
Article CAS PubMed Google Scholar
Murat, F., Armero, A., Pont, C., Klopp, C. & Salse, J. Reconstructing the genome of the most recent common ancestor of flowering plants. Nat. Genet. 49, 490–496 (2017).
Article CAS PubMed Google Scholar
Gaut, B. S., Morton, B. R., McCaig, B. C. & Clegg, M. T. Substitution rate comparisons between grasses and palms: synonymous rate differences at the nuclear gene Adh parallel rate differences at the plastid gene rbcL. Proc. Natl Acad. Sci. USA 93, 10274–10279 (1996).
Article CAS PubMed PubMed Central Google Scholar
Kong, Y. Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing technologies. Genomics 98, 152–153 (2011).
Article CAS PubMed Google Scholar
Pollier, J., Rombauts, S. & Goossens, A. Analysis of RNA-Seq data with TopHat and Cufflinks for genome-wide expression analysis of Jasmonate-Treated plants and plant cultures. Methods Mol. Biol. 1011, 305–315 (2013).
Article CAS PubMed Google Scholar
Huang, H. et al. Global increase in DNA methylation during orange fruit development and ripening. Proc. Natl Acad. Sci. USA 116, 1430–1436 (2019).
Article CAS PubMed PubMed Central Google Scholar
Punta, M. et al. The Pfam protein families database. Nucleic Acids Res. 30, 276–280 (2000).
Google Scholar

Download references

Acknowledgements

This research was financially supported by the National Key R&D Program of China (2018YFD1000605) and the Tianjin Science Fund for Distinguished Young Scholars (18JCJQJC48300).

Author information

These authors contributed equally: Shijing Feng, Zhenshan Liu, Jian Cheng, Zihe Li

Authors and Affiliations

College of Forestry, Northwest A&F University, Yangling, Shaanxi, China
Shijing Feng, Lu Tian, Tuxi Yang, Yulin Liu, Yonghong Liu & Anzhi Wei
Research Centre for Engineering and Technology of Zanthoxylum State Forestry Administration, Yangling, Shaanxi, China
Shijing Feng, Lu Tian, Tuxi Yang, Yulin Liu, Yonghong Liu & Anzhi Wei
College of Life Science, Northwest A&F University, Yangling, Shaanxi, China
Zhenshan Liu
Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
Jian Cheng & Huifeng Jiang
School of Ecology and Environment, Northwestern Polytechnical University, Xi’an, Shanxi, China
Zihe Li
Biomarker Technologies Corporation, Beijing, China
Min Liu & He Dai
Center for Information in Biology, College of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
Zujun Yang
Center for Genomics and Biotechnology, Haixia Institute of Science and Technology, Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, College of Life Sciences, Fujian Agriculture and Forestry University, Fuzhou, China
Qing Zhang, Gang Wang & Jisen Zhang

Authors

Shijing Feng
View author publications
You can also search for this author in PubMed Google Scholar
Zhenshan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jian Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Zihe Li
View author publications
You can also search for this author in PubMed Google Scholar
Lu Tian
View author publications
You can also search for this author in PubMed Google Scholar
Min Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tuxi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yulin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yonghong Liu
View author publications
You can also search for this author in PubMed Google Scholar
He Dai
View author publications
You can also search for this author in PubMed Google Scholar
Zujun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Gang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jisen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Huifeng Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Anzhi Wei
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.W. and H.J. designed the study. S.F., Z.L., L.T., T.Y., Y.L. and Y.L. prepared materials for genomic and RNA-Seq analysis. S.F. and Z.L. coordinated the project and supervised the data analysis. M.L. and H.Z. sequenced the Z. bungeanum genome and transcriptome. H.D., G.W. and Z.L. assembled, annotated, and analyzed the genomes. J.C. performed the RNA-Seq analysis. Q.Z. performed the karyotype evolution analysis. Z.Y. performed the chromosome karyotype experiment. S.F., H.J., Z.L., J.C. and J.Z. wrote the manuscript. All authors discussed the results and commented on the manuscript.

Corresponding authors

Correspondence to Jisen Zhang, Huifeng Jiang or Anzhi Wei.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Supplementary information

Supplemental tables for manuscript

Supplemental text and figures for manuscript

Data Set 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Feng, S., Liu, Z., Cheng, J. et al. Zanthoxylum-specific whole genome duplication and recent activity of transposable elements in the highly repetitive paleotetraploid Z. bungeanum genome. Hortic Res 8, 205 (2021). https://doi.org/10.1038/s41438-021-00665-1

Download citation

Received: 15 June 2021
Revised: 29 July 2021
Accepted: 30 July 2021
Published: 03 September 2021
DOI: https://doi.org/10.1038/s41438-021-00665-1

This article is cited by

Insights into chloroplast genome evolution in Rutaceae through population genomics
- Chao-Chao Li
- Yi Bao
- Wen-Wu Guo
Horticulture Advances (2024)
Genomic survey and expression analysis of cellulose synthase superfamily and COBRA-like gene family in Zanthoxylum bungeanum stipule thorns
- Weilong Gao
- Jiangbo Nie
- Yulin Liu
Physiology and Molecular Biology of Plants (2024)
Approaches to increase the validity of gene family identification using manual homology search tools
- Benjamin J. Nestor
- Philipp E. Bayer
- Patrick M. Finnegan
Genetica (2023)