De novo transcriptome assembly of Zanthoxylum bungeanum using Illumina sequencing for evolutionary analysis and simple sequence repeat marker development

Zanthoxylum, an ancient economic crop in Asia, has a satisfying aromatic taste and immense medicinal values. A lack of genomic information and genetic markers has limited the evolutionary analysis and genetic improvement of Zanthoxylum species and their close relatives. To better understand the evolution, domestication, and divergence of Zanthoxylum, we present a de novo transcriptome analysis of an elite cultivar of Z. bungeanum using Illumina sequencing; we then developed simple sequence repeat markers for identification of Zanthoxylum. In total, we predicted 45,057 unigenes and 22,212 protein coding sequences, approximately 90% of which showed significant similarities to known proteins in databases. Phylogenetic analysis indicated that Zanthoxylum is relatively recent and estimated to have diverged from Citrus ca. 36.5–37.7 million years ago. We also detected a whole-genome duplication event in Zanthoxylum that occurred 14 million years ago. We found no protein coding sequences that were significantly under positive selection by Ka/Ks. Simple sequence repeat analysis divided 31 Zanthoxylum cultivars and landraces into three major groups. This Zanthoxylum reference transcriptome provides crucial information for the evolutionary study of the Zanthoxylum genus and the Rutaceae family, and facilitates the establishment of more effective Zanthoxylum breeding programs.


Results
Sequencing and de novo assembly of Illumina paired-end reads. In this study, we generated sequences for the transcriptome of an elite Z. bungeanum cultivar, 'Fengxiandahongpao' . Approximately 56.2 million high-quality reads were generated, with a GC content of 43.38%. The combined sequences of high-quality reads were assembled into 65,337 individual transcripts and 45,057 unigenes, reaching a total length of 41.0 Mb and 27.5 Mb, respectively. The average length of the assembled transcripts was 627 bp, with an N50 of 874 bp. The length of the assembled unigenes ranged from 201 to 6,750 bp, with an average of 610 bp and an N50 of 846 bp (Table 1).

Sequence annotation.
For validation and annotation of the sequence assembly contigs and unique singletons, all unigenes were searched against six public protein databases, including the NCBI non-redundant protein (Nr) database, protein family (Pfam), Swiss-Prot, the Kyoto Encyclopedia of Genes and Genomes pathway   Figure S1). To further analyze the BLAST results, E-value and similarity distributions were calculated using the Nr database. The E-value distribution of the top hits revealed that 78.77% of the mapped sequences showed significant homology (E-value < 10 −30 ), and 21.22% of the homologous sequences had E-values in the range 10 −6 to 10 −30 (Supplement Figure S2A). Additionally, 91.18% and 76.17% of the sequences were found to have similarities >70% and 80%, respectively (Supplement Figure S2B). These results reflect the high identities of the mapped sequences, suggesting a good assembly quality.

Functional classification by GO analysis.
In total, 32,715 unigenes (72.61%) with BLAST matches to known proteins were assigned to GO classes with 132,647 functional terms (Fig. 2). The annotated unigenes that belonged to the biological process, cellular component, and molecular function clusters were categorized into 43 functional groups. Molecular function (47,706,35.96%) and biological process (46,861,35.33%) represented the largest number of unigenes, followed by cellular component (38,080, 28.71%).
Evolutionary pattern of protein coding sequences. The protein coding sequences (CDS) extracted from unigenes also provided comprehensive information for phylogenetic analysis. A phylogenetic analysis was conducted based on the 34 single-copy nuclear genes (Supplement Dataset S1) shared in Zanthoxylum, sweet orange (Citrus sinensis), poplar (Populus euphratica), grape (Vitis aestivalis), strawberry (Fragaria × ananassa), soybean (Glycine max), Arabidopsis, cacao (Theobroma cacao), peach (Prunus persica), and castor bean (Ricinus communis). Zanthoxylum was found to be phylogenetically closest to sweet orange, cacao, poplar, and castor bean (Fig. 5A). For comparison, a phylogenetic tree was derived from 68 chloroplast genes (Dataset S2) in the same species, which showed that Zanthoxylum was phylogenetically closest to sweet orange, cacao, Arabidopsis, and grape (Supplement Figure S3). This indicated a significantly different evolutionary pattern between or within nuclear and plastid genomes in some plant species (e.g., grape and Arabidopsis). Assuming orange diverged from cacao 85 million years ago (MYA) 17 , we confidently estimated a divergence time for Zanthoxylum and Citrus of 36.5 and 37.7 MYA from the nuclear and chloroplast trees, respectively ( Fig. 5A and S3).  Meanwhile, an age distribution was calculated by comparing the synonymous substitution values (Ks) of 2,750 orthologous gene pairs between the Zanthoxylum and Citrus genomes, 794 paralogous gene pairs in the Zanthoxylum transcriptome, and 1,054 paralogous gene pairs in Citrus genomes. A Zanthoxylum whole-genome duplication (WGD) event occurred 14 MYA (Ks peaks at 0.074), more recently than the divergence of Zanthoxylum and Citrus. Interestingly, we detected three peaks in the Ks distribution in orange, suggesting that at least three ancient WGD events occurred in orange (Fig. 5B). A second large Ks peak at a synonymous substitution rate of about 1.7 was observed in both Zanthoxylum and Citrus, indicating that an ancient WGD event occurred in their common ancestor ( Fig. 5B and Supplement Figure S4). We also inferred the direction and magnitude of natural selection acting on protein coding genes in Zanthoxylum by calculating the Ka/Ks through referring to Citrus orthologous genes. Only four genes with Ka/Ks > 1 but non-significant were observed, implying that most Zanthoxylum genes experienced purifying or stabilizing selection (Supplement Figure S5).
To better understand Zanthoxylum-specific biology, we identified Zanthoxylum-specific genes by comparative analyses of the Zanthoxylum protein CDS with 12 other plant genomes and rescreened them in the Nr protein database to exclude false-positive genes caused by incomplete genome annotation. In total, 22,212 CDS in the Zanthoxylum transcriptome were clustered into 8154 gene families, with 173 genes being specific to Zanthoxylum (Supplement Figure S6). One hundred of the Zanthoxylum-specific genes had unknown function. Of the 73 annotated genes, overrepresented molecular functions were oxidation-reduction, regulation of transcription, and translation processes. Exploring the function of these genes will contribute significantly to understanding of important Zanthoxylum-specific biological processes, such as the synthesis of numb-taste components.
Identification of single nucleotide polymorphisms. From the Z. bungeanum transcriptome sequencing data, we identified 170,000 single nucleotide polymorphisms (SNPs) in 37,658 unigenes. The proportions of transversions were: 10.48% A/C; 12.44% A/T; 7.04% C/G and 10.33% G/T. The proportions of transitions were: 29.61% A/G and 30.10% C/T. The percentage of transitions (59.71%) was 1.48-times higher than that of transversions (40.29%) in the Z. bungeanum transcriptome. Further analysis of the SNPs revealed that approximately 72% of them were distributed in CDS, and 20,461 CDS contained at least one SNP, leading to a rough estimate of polymorphism density of 6.16 SNPs per kb in sequences containing CDS. The large number of SNPs located in sequences containing CDS is possibly beneficial for future development of important economic and agronomic traits. These SNPs could be useful for molecular breeding programs and marker-assisted selection practices, as well as for understanding phenotypic differences between Zanthoxylum species.
Except for compound SSRs, A/T was the most abundant motif in the two possible types of mono-nucleotide repeat, contributing 16.31% of the total SSRs. The C/G motif was less abundant than A/T, accounting for 0.21% of the total SSRs. The dominant di-nucleotide repeat motif in SSRs was AG/CT accounting for 10.38% of the total SSRs. The AC/GT motif was slightly more abundant than the AT/AT motif, which constituted 1.74% and 1.00% of the total SSRs, respectively. Among the trinucleotide repeat motifs, the two most frequent repeats were GAA/ TTC (5.04% of the total SSRs) and AAG/CTT (4.54%), followed by ATC/GAT (3.30%), AGA/TCT (3.57%), and CAG/CTG (3.48%). Other motifs among quad-, penta-, and hexa-nucleotide repeats constituted 20.88% of the total SSRs (Fig. 6C).

Development and validation of SSR markers.
Out of 3,814 SSRs, a total of 1,000 primer pairs were successfully designed; 100 primer pairs were selected at random and tested to amplify the genomic DNA of Z. bungeanum (20 cultivars and two landraces) and Z. armatum (five cultivars and four landraces). Sixty primer pairs resulted in successful PCR amplification, 25 of which produced fragments of the expected size. Taking into account the specific amplification and polymorphic loci, we selected 16 primer pairs that presented clear and polymorphic loci to help evaluate polymorphism levels in the two Zanthoxylum species. After further testing, we selected 11 primer pairs for subsequent polymorphic analysis (Table 2) Table S2).
Neighbor-joining tree and STRUCTURE analyses clearly grouped the 31 cultivars and landraces into three major clusters (Fig. 7). Cluster I (cyan) included 13 cultivars of Z. bungeanum mainly distributed north of the Qinling Mountains. Cluster II (purple) included nine Z. bungeanum cultivars mainly distributed south of the Qinling Mountains. These two clusters are commonly known as "Red huajiao" for their bright red pericarp. Cluster III (orange) consisted of nine cultivars of Z. armatum, commonly known as "Green huajiao" for their bright green pericarp.

Discussion
Here we report the first transcriptome analysis of Zanthoxylum, an economic crop of great importance because of its flavor and industrial and medical applications 18 . The transcriptome sequence of Z. bungeanum provides a genomic resource for genetic study and breeding improvement of Zanthoxylum, and a reference for plant transcriptome-scale evolutionary analyses of the Rutaceae family.
Since Z. bungeanum has been bred for more than 4,000 years, a large number of elite cultivars have been widely dispersed, named, and released in China. Sequencing of the Z. bungeanum transcriptome provides genomic data for comprehensively assessing the heterozygosity of Zanthoxylum plants. Our transcriptome analysis indicated that Z. bungeanum might have a high degree of heterozygosity. Heterozygosity is a common feature in most plants and has important biological functions in crops. The heterozygosity of Z. bungeanum, estimated in the present study, is higher than that of sweet orange 17 , which suggests Z. bungeanum may have undergone an intensive inter-or intra-specific hybridization. Moreover, the high values of mean H o and H e also suggest a relatively high heterozygosity, while the negative value of F IS also indicates a significant heterozygote excess in Z. bungeanum. Therefore, it is not feasible to consider whole-genome sequencing of Zanthoxylum. Fortunately, the advent of RNA-Seq is useful for collecting genetic information from Z. bungeanum.   GO functional classification may help us to understand the distribution of gene functions at the macro level, and to predict the physiological role of each unigene. KOG is a database in which orthologous gene products are classified 19 . In the present study, the annotated unigenes were classified into 25 KOG and 43 GO subterms or subcategories, suggesting that our transcriptome data for Z. bungeanum represented a wide diversity of transcripts 20,21 . Meanwhile, the results revealed that the assembled unigenes had diverse molecular functions and were involved in many metabolic pathways. Under the biological process category following GO classification, cellular process and metabolic process were prominent, indicating the occurrence of important cellular and metabolic activities in Z. bungeanum.
KEGG pathways are helpful to better understand biological function and gene interaction. Functional classification based on KEGG pathways indicated that numerous important metabolic pathways in Z. bungeanum were unknown and worth investigation. Obviously, "metabolic pathways" represented the largest category, suggesting Z. bungeanum invests in cell maintenance and defense. Moreover, unigenes associated with metabolism of cofactors and vitamins, terpenoids and polyketides, and biosynthesis of other secondary metabolites were identified in the present study, with evidence that numerous biologically active secondary metabolites have been isolated in Zanthoxylum species 18 . 'Human diseases' represented another dominant pathway and contained six different subcategories, indicating a complex composition of the Zanthoxylum genome, which makes it an important target for genomic study.
Phylogenetic analysis based on nuclear and chloroplast genes indicated that Zanthoxylum is relatively recent and estimated to have diverged from Citrus 36.5-37.7 MYA. Interestingly, the nuclear phylogeny showed that Arabidopsis was not linked to Malvidae with full bootstrap value support, whereas the chloroplast tree topology demonstrated that Arabidopsis was included in the Malvidae. This contradiction may be due to the difference in evolutionary pattern between or within nuclear and chloroplast genes, and the sensitivity of the topology to the maximum likelihood methods.
The WGD is predominant in most flowering plant species during evolution 22 , as gene duplication has provided a necessary source of material for the origin of evolutionary novelties. Due to functional redundancy, genes are rapidly silenced or lost from the duplicated genomes 23 . The detection method of plant WGDs is derived from genomic structure (e.g., syntenic analysis), phylogenetic tree construction, and evolutionary distances (e.g., synonymous mutation rate, Ks) 24,25 . To investigate the genome evolutionary history of Zanthoxylum, we estimated the WGD time in the Zanthoxylum genome by calculating the Ks age distribution using the formula T = Ks/2r. Here we chose r = 2.6 × 10 −9 as the neutral substitution rate, because Zanthoxylum species are long-lived perennials and the generation time is negatively correlated with molecular evolutionary rates 26 . The synonymous mutation rates of several plants have been estimated using a small set of genes (2.6 × 10 −9 -1.5 × 10 −8 ) 27 . It is likely that the slower rates may be more appropriate for Zanthoxylum. As for the transformation of the Ks values to absolute ages, low levels of Ks produce a more accurate age estimation than high Ks values because of possible substitution-saturation effects and molecular rate heterogeneity among lineages. In the present study, we deduced a recent WDG event in Zanthoxylum 14 MYA. We also calculated the Ks of each pair of three-copy and four-copy genes in Zanthoxylum and orange. Strikingly, they presented a similar but flatter curve than that of two-copy gene pairs, providing support for our confident estimation of Ks distribution and a more ancient duplication for multicopy genes.
In the present study, trinucleotide repeats were the most abundant type of SSR motifs in the transcriptome of Z. bungeanum, followed by the di-type motifs. This finding is similar to the results reported for pummelo 28 and cotton 29 . Two hundred and six SSRs were present in the compound types, meaning they contained stretches of two or more different repeats. Among mononucleotide repeats, as in most plants, the A/T repeats were far more abundant than G/C repeats 30,31 . For the dinucleotide repeat category, the most abundant motif was AG/CT, which is similar to previous results in Hemarthria compressa "Yaan" and H. altissima "1110" 32 , Ipomoea batatas 33 , Brachypodium distachyon, and Oryza sativa 34 . Thirty-one motifs were found within the trinucleotide repeats; GAA/TTC and AAG/ CTT were the most abundant, similar to results for colored calla lily 35 . These results support previous observations that dinucleotide AG/CT motifs and trinucleotide AAG/CTT motifs are the most abundant SSR motifs in dicots 36 .
Mining transcriptome data is an efficient way for SSR marker development and offers great flexibility in selecting markers for different applications at different resolutions 37 . Moreover, SSRs derived from a transcriptome database have advantages over other strategies, such as low cost and short time. In the present study, we verified and obtained 11 variety-specific EST-SSR markers (Table 2), which provides a simple, informative, reproducible, and suitable approach for evaluation of genetic relationships in Zanthoxylum species. The neighbor-joining tree and STRUCTURE analyses divided 31 Zanthoxylum cultivars and landraces into three clear clusters. The availability of the Z. bungeanum transcriptome and deep exploitation of Zanthoxylum-specific genes with known or unknown biochemical functions will facilitate a better understanding of agronomically and economically important traits, including yield, numb-taste components, oil, and stress resistance in subsequent research.

Materials and Methods
Plant material. Z. bungeanum cultivar "Fengxiandahongpao" with excellent pericarp was used in this study. At the sprouting stage, we collected 40 stem tips (1 cm) from five 7-year-old individual plants of Z. bungeanum from the Research Centre for Engineering and Technology of Zanthoxylum, State Forestry Administration, Northwest A&F University (Fengxian, Shaanxi Province, China). All stem tips were immediately frozen in liquid nitrogen and stored at −80 °C to preserve RNA for transcriptome sequencing. Currently, approximately 30 cultivars of Z. bungeanum and nine cultivars of Z. armatum are planted in China. A total of 31 cultivars and landraces, including Z. bungeanum (20 cultivars and two landraces) and Z. armatum (five cultivers and four landraces, Supplement Table S1), were collected from the main cultivated regions and used for estimating the SSR markers and cluster analysis.

RNA isolation, integrity examination and RNA-Seq library preparation.
Total RNA was extracted from each group using TRIzol Regent (Invitrogen, Life Technologies, Carlsbad, CA, USA) according to the manufacturer's instructions. The extracted RNA was treated with a Total RNA Purification Kit (LC science, USA) for 30 min at 37 °C to remove any residual DNA. The quality and quantity of RNA were detected using a 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA).
After total RNA extraction, the mRNA in each group was purified and enriched using oligo (dT) magnetic beads. Following purification, a fragmentation buffer and elevated temperatures were used to break the mRNA into fragments of 200-300 bp using divalent cations. Subsequently, the cleaved RNA fragments were employed for first-strand cDNA synthesis using reverse transcriptase and random primers, and then the second-strand cDNA was synthesized using DNA polymerase I and RNaseH (Invitrogen). Double stranded cDNAs then underwent an end repair process with T4 DNA polymerase and Klenow DNA polymerase. These repaired cDNA fragments were then subjected to 3ʹ-single adenylation and ligated with sequencing adapters. These products were purified and amplified via PCR to construct a cDNA library for transcriptome sequencing. Finally, the cDNA library products were sequenced on a paired-end flow cell using an Illumina HiSeq ™ 2500 platform (Illumina Inc., San Diego, CA, USA).
Assembly and functional annotation. The raw reads of cDNA libraries were cleaned by a stringent filtering process. All adapter fragments as well as ambiguous reads containing >5% unknown nucleotides and low-quality reads with more than 20% Q < 10 bases were removed to ensure the accuracy of de novo assembly and subsequent analyses. The clean reads were assembled using Trinity software 10 with the parameters set at K-mer length of 25 and a similarity of 80%, with the other parameters set to their default values. The clean reads with a certain length of overlap were first joined to form longer fragments without N, known as contigs. Subsequently, the clean reads were mapped back to the contigs, and paired-end reads were used to calculate the distance and relation among these contigs. Trinity connected these contigs to obtain consensus sequences that could not be extended on either end, called transcripts. Finally, the transcripts were clustered into unigenes using TGI Clustering Tools 38 . Assembled sequences of less than 200 bp were discarded due to a low annotation rate.
All unigenes were subjected to BLASTx searches against the NCBI Nr database, the Swiss-Prot database, KOG, and KEGG with an E-value threshold of 10 −5 . According to the Nr annotation, the functional annotations of unigenes were obtained based on GO terms (GO annotation: http://www.geneontology.org) by employing Blast2GO 39 .
Evolutionary analysis of CDS. Genescan 40 was employed for gene prediction from unigenes. In total, 22,212 open reading frames were detected. The searching of single-copy homologous genes was performed by comparing the protein sequences of Zanthoxylum, sweet orange (Citrus sinensis), poplar (Populus euphratica), grape (Vitis aestivalis), strawberry (Fragaria × ananassa), soybean (Glycine max), Arabidopsis, cacao (Theobroma cacao), peach (Prunus persica), and castor bean (Ricinus communis), using all-to-all BLASTP analysis (E-value < 1E −5 ). Based on the species distribution of the top BLASTx hits against the Nr database, we selected sweet orange, poplar, grape, strawberry, soybean, Arabidopsis, cacao, peach, and castor bean for comparative phylogenetic analysis. Of these, Zanthoxylum, sweet orange, Arabidopsis, cacao, poplar, and castor bean belong to the Malvidae, a related phytogroup of Angiosperm Phylogeny Group classification. Rice (Oryza sativa) and sorghum (Sorghum bicolor) were used as out-groups. The OrthoMCL 41 method was adopted to construct gene families for the 12 plant species. The translated putative amino acid sequences of the 34 shared single-copy genes were aligned in MAFFT 42 and adjusted manually. Based on the concatenated sequences, phylogenetic analysis using the maximum-likelihood method was conducted in FastTree 43 with the JTT model plus gamma-distributed rates and 1,000 bootstrap replications. For comparison, we also used peptide sequences of 68 genes from chloroplast genomes from the same plant species and performed maximum-likelihood analysis with the same parameters.
To detect the signature of genome duplications, an OrthoMCL clustering program was run on the protein CDS from the Zanthoxylum and Citrus genomes. One-to-one orthologous genes were then extracted from the clustering results. Subsequently, we set up a stringent criterion of at least 300 nucleotide alignments and 70% identity to reliably define two sequences as orthologous. Gene pairs with a BLAST identity of ≥99.0% were defined as identical sequences because near identical sequences occasionally are present in clustered transcript data due to sequencing errors and the fragmentary nature of transcripts. All identical gene pairs that contained the shorter sequence were discarded. The codon based alignment of the nucleotide sequence pairs was used to calculate the Ks by KaKs_Calculator 44 . The duplication time of the genome was estimated using the formula T = Ks/2r, where 'r' is the neutral substitution rate. A neutral substitution rate of 2.6 × 10 −9 was used in this study 27 . The Ks of gene pairs of three-copy and four-copy genes families were also calculated.

Variant analysis.
Only reliable clean reads were considered for SNP calling. Indels were not considered because alternative splicing impedes reliable indel discovery 17 . SNPs were called using the SAMtools software package 45 . The assembled transcript sequences were used as templates to BLAST the clean reads. SSR locus detection and primer design. SSR locus screening was performed using the Perl script MIcroSAtellite (MISA, http://pgrc.ipk-gatersleben.de/misa/) based on the flanking sequences. SSR loci consist of repeated core sequences of two to six nucleotide motifs with minimum repeats of 6, 5, 4, 3 and 3, respectively. Mononucleotide repeats and complex SSRs were excluded from further analyses, because mononucleotide repeats may not be accurate due to sequencing errors and assembly mistakes, whereas complex SSRs show the least polymorphism. Primers could not be designed for those SSR loci whose flanking sequences were either too short or where the sequence did not meet appropriate criteria for primer design.
The SSR primer pairs were designed using the Oligo 7.0 program 46 based on the following parameters: 1) primer length of 17-25 bp with 20 bp as the optimum; 2) PCR product size ranging from 100 to 500 bp; 3) GC content of 40-70% with an optimum of 50%; 4) melting temperature (T m ) between 52 °C and 60 °C with 55 °C as the optimum annealing temperature, with a difference of no greater than 4 °C between the T m values of the forward and reverse primers; 5) primer pairs devoid of secondary structure or consecutive tracts of a single nucleotide; 6) the designed primer sequence was limited to the middle region, with 30 bp being removed from the ends of the contig sequence 47 . In total, 100 primers for SSR markers were synthesized. DNA isolation, PCR amplification and SSR validation. The total genomic DNA was extracted from dried-leaf tissue of 31 cultivars and landraces of Z. bungeanum and Z. armatum (Supplement Table S1) using a modified CTAB protocol 48 . All SSR primer pairs were tested for amplification and polymorphism using the mixed DNA 49 from five cultivars: 'Fengxiandahongpao' , 'Hanchengdahongpao' , 'Huangjinjiao' , 'Dingtanhuajiao' , and 'Qinghuajiao' (Supplement Table S1). In total, 100 pairs of primers were designed and validated by PCR and electrophoresis. Amplification by PCR was performed using an A200 Gradient Thermal cycler (LongGene, China), with reactions consisting of 10 µL of 2× Taq MasterMix (CWBIO Biotechnology, Beijing, China); 0.5 µL of each primer (100 µmol/L), 8 µL of double distilled water, and 1.0 µL of genomic DNA in a final reaction volume of 20 µL. The PCR conditions were: initial denaturation at 94 °C for 4 min, followed by 30 cycles of 1 min at 94 °C, 50 s at 52 to 60 °C (depending on the T m of the primer pair used), and 1 min at 72 °C; then a final extension for 7 min at 72 °C. The resulting PCR products were subjected to electrophoresis on 12% non-denaturing polyacrylamide gels at 250 V for 2-2.5 h and stained using a silver-staining protocol 50 . Cluster analysis was conducted and displayed using the neighbor-joining algorithm and STRUCTURE analysis as implemented in MEGA ver. 6.0 51 and STRUCTURE ver. 2.3.4 52 . The genetic diversity parameters (N a , N e , H o , H e , I, and F IS ) of selected SSR primer pairs were calculated using the POPGENE software 53 .
Data Availability. Transcriptome datasets supporting the conclusions of this article are available in the NCBI SRA repository under the accession number SRR5114371. The full dataset is also available from Shijing Feng upon request (shijing0554010112@163.com).