We produced complete sequences and conducted comparative analysis of the maternally inherited chloroplast (cp) genomes and bi-parentally inherited 45S nuclear ribosomal RNA genes (nrDNA) from ten Araliaceae species to elucidate the genetic diversity and evolution in that family. The cp genomes ranged from 155,993 bp to 156,730 bp with 97.1–99.6% similarity. Complete 45S nrDNA units were about 11 kb including a 5.8-kb 45S cistron. Among 79 cp protein-coding genes, 74 showed nucleotide variations among ten species, of which infA, rpl22, rps19 and ndhE genes showed the highest Ks values and atpF, atpE, ycf2 and rps15 genes showed the highest Ka/Ks values. Four genes, petN, psaJ, psbF, and psbN, related to photosynthesis and one gene, rpl23, related to the ribosomal large subunit remain conserved in all 10 Araliaceae species. Phylogenetic analysis revealed that the ten species could be resolved into two monophyletic lineages, the Panax-Aralia and the Eleutherococcus-Dendropanax groups, which diverged approximately 8.81–10.59 million years ago (MYA). The Panax genus divided into two groups, with diploid species including P. notoginseng, P. vietnamensis, and P. japonicus surviving in Southern Asia and a tetraploid group including P. ginseng and P. quinquefolius Northern Asia and North America 2.89–3.20 MYA.
The Araliaceae (also known as the ginseng family) and the Apiaceae are the major families in the order Apiales belonging to Asterid II1,2,3. The Araliaceae family comprises 55 genera and more than 1,500 plant species widely distributed in tropical, subtropical and temperate regions4, 5, many of which are used as oriental medicines, such as species in the genus Panax, Eleutherococcus and Aralia 6, 7. According to taxonomical studies, Araliaceae encompasses two large monophyletic groups: the Aralia-Panax group and the Asian Palmate group8. The Aralia-Panax group consists of the two closely-related genera, Aralia and Panax. Meanwhile, the Asian Palmate group is represented by the genera Eleutherococcus, Dendropanax, and Schefflera characterized as distinctive woody plants.
Although the conserved basic chromosome number was estimated to be x = 12 in Araliaceae family species based on diploid taxa (2n = 24), the chromosome numbers vary from 2n = 48 to 2n = 192 in polyploid species of the family9, 10. In the genus Panax, P. notoginseng, P. vietnamensis and P. japonicus are diploid with chromosome number of 2n = 24, while P. ginseng and P. quinquefolius are considered to be tetraploid with chromosome numbers of 2n = 48. The genus Aralia and Eleutherococcus are reported to have various chromosome numbers including 2n = 24, 36 or 48. (CCDB-chromosome count database; http://ccdb.tau.ac.il/).
Although many studies have reported taxonomical classification and divergence of Araliaceae species based on molecular data derived from a few chloroplast (cp) and nuclear sequences11,12,13,14,15,16, genetic diversity surveys and molecular phylogenetic classification of Panax and its relatives are still very limited. Cytoplasmic cp genomes and nuclear ribosomal DNA have widely been used to elucidate the evolution of plant species, owing to their characteristic highly conserved sequences, such that minor sequence divergences reflect evolutionary history17,18,19,20. Recently, we developed a de novo assembly method to obtain complete cp and nrDNA sequences using low-coverage whole-genome sequence (dnaLCW) and applied it to reveal the evolutionary history of various plant lineages such as Oryza AA genomes and Epimedium species21, 22, and also to identify intra-species diversity23.
In this study, we characterized cp genomes and 45S nrDNA sequences of ten Araliaceae species including Panax, Aralia, Eleutherococcus, and Dendropanax species and investigated genetic diversity among them to understand diversity and molecular evolution of the Araliaceae species.
Complete chloroplast genomes and 45S nrDNA sequences
Novel complete cp genomes of five species, P. notoginseng, P. japonicus, P. vietnamensis, A. elata, and E. sessiliflorus, and 45S nrDNA sequences of seven species, P. quinquefolius, P. notoginseng, P. japonicus, P. vietnamensis, A. elata, and E. sesiliflorus, and D. morbifera, were characterized in this study (Table 1).
Complete lengths of the cp genomes ranged between 155,993 bp (P. vietnamensis) and 156,730 bp (E. sessiliflorus), with average read-mapping coverages of 57× to 1,005× (Table 1). The 45S nrDNA sequences were assembled into single contig for each of the seven species. The sequence lengths of the complete 45S nrDNA unit was around 11 kb including full intergenic spacer (IGS) sequences (Table 1). The 45S nrDNA sequences of P. quinquefolius and E. sessiliflorus were full-length sequences covering a 45S transcription unit (18S-ITS1–5.8S-ITS2-26S) and complete IGS sequences, whereas those of the other five species (P. notoginseng, P. japonicus, P. vietnamensis, A. elata and D. morbifera) consisted of 45S nrDNA transcription regions and partial IGS sequences (Table 1).
Diversity of cp genome sequences
In addition to the five cp genomes described above, we used cp genomes of three species, P. ginseng, P. quinquefolius, and D. morbifera that were obtained in our previous study23,24,25. Cp genome sequences for two Araliaceae species, A. undulata, E. senticosus, were also retrieved from GenBank for comparative analysis (Table 1). The gene order and gene content were highly conserved among the ten cp genomes (Fig. 1a), and the complete cp genome sequences showed 97.1~99.6% similarity among ten species (Supplementary Table S1). With regard to the quadripartite structure of the cp genome, nucleotide polymorphisms were lower in the inverted repeat regions (IRs) than in large single copy (LSC) and small single copy (SSC) regions (Fig. 1b). The cp genomes of five Panax species shared more than 98.9% similarity, among which P. ginseng and P. quinquefolius showed the highest similarity over 99.6% at both whole cp genome and cp coding sequence levels, respectively (Supplementary Table S1).
The sequence polymorphism rates of conserved coding regions were from 0.4~1.8% among ten Araliaceae species, in which the rate between P. ginseng and P. quinquefolius was the lowest (0.4%). Five genes, namely petN, psaJ, psbF, psbN, and rpl23, showed no sequence polymorphism among the ten Araliaceae species (Fig. 2 and Supplementary Dataset S1, Supplementary Fig. S4). The ndhF genes in cp genomes of Aralia, Eleutherococcus and Dendropanax species were 97 bp, 36 bp and 18 bp farther from the border junction of IRa and SSC, compared to those in cp genomes of Panax species (Supplementary Fig. S1).
The average Ks values for all protein-coding genes except cp-rRNA genes were from 0.0014 to 0.0199 and 0.0014 to 0.0068 at the inter- and intra-genus levels, respectively. The lowest Ks value of 0.0014 was between the two closest Panax species, P. ginseng and P. quinquefolius. The average Ks value among five species within the Panax genus was lower (0.0056) than those between Aralia-Panax (0.0110) and Eleutherococcus-Dendropanax (0.0080) (Table 2). The highest Ks values were detected in infA, rpl22, rps19 and ndhE genes, while significantly high Ka/Ks ratios of more than 1 were detected in atpF, atpE, ycf2 and rps15 genes (P-value 8.3E-27–3.2E-2, Chi-square test) (Fig. 2).
Although repeat motifs were diverse among cp genomes, conserved common repeat motifs were also present at both the intra-genus and inter-genus levels. Tandem repeat (TR) units with sizes of 6 to 12 bp were abundant among ten cp genomes (Supplementary Fig. S2). High copy number variation (CNV) for TRs was detected in the ycf1 protein-coding gene and in the intergenic regions between trnC-GCA and petN, trnS-GGA and rps4, trnT-UGU and trnL-UAA, and cemA and petA (Fig. 3 and Supplementary Table S2, Supplementary Fig. S3).
Sequence variations of nrDNA sequences in Panax and relatives
In addition to seven 45S nrDNA sequences assembled in this study, 45S nrDNA sequence of P. ginseng and nrITS sequences of A. undulata and E. senticosus were retrieved from GenBank and used for comparative analysis (Table 1). When 45S nrDNA transcription sequences (18S-ITS1-5.8S-ITS2) were compared among eight species (excluding A. undulata and E. senticosus for which there was no available sequence), sequence polymorphisms were found in genic regions as well as in two ITS regions although the polymorphisms were more relatively frequent in two ITS regions (Fig. 4 and Supplementary Tables S3 and S4); 9, 4, 50, 42, and 44 SNPs in 18S (1,808 bp), 5.8S (160 bp), 26S (3,452 bp), ITS1 (223 bp) and ITS2 (233 bp) regions, respectively. The numbers of SNPs in the 45S nrDNA sequence were higher at the inter-genus level than at the intra-genus level (Supplementary Tables S3 and S4).
Phylogenetic analysis and divergence time estimation
A phylogenetic tree based on cp protein coding sequences showed two typical monophyletic lineages consisting of the Aralia-Panax group and the Eleutherococcus-Dendropanax group, which was also confirmed by a phylogenetic tree based on nrITS sequences (Fig. 5). Species of each genus, Panax, Aralia, Eleutherococcus, and Dendropanax, were grouped separately. The five Panax species were divided into two subgroups, in which one included P. ginseng and P. quinquefolius and the other included the remaining three Panax species (Fig. 5a). The topology of the tree based on nrITS sequences was almost identical to that in the cp sequence-derived tree, except for some differences among the Panax species (Fig. 5b).
Based on cp protein coding sequences, the divergence time between the Aralia-Panax group and the Eleutherococcus-Dendropanax group could be estimated at approximately 8.81–10.59 MYA (Fig. 5a). In the Eleutherococcus-Dendropanax clade, Dendropanax and Eleutherococcus were estimated to have diverged 4.48–5.60 MYA. In the Aralia-Panax clade, Aralia and Panax were estimated to have diverged 7.97–8.46 MYA, with subsequent speciation within these two genera likely to have occurred during a period approximately 2.89–3.20 and 2.58–3.20 MYA in Panax and Aralia, respectively. The two closest Panax species, P. ginseng and P. quinquefolius, were predicted to have diverged approximately 0.72–0.87 MYA.
Genetic diversity in cp genomes of the Araliaceae family
Sequence variation was low throughout cp protein coding sequences and intergenic sequences among the ten Araliaceae species examined herein, although the cp protein coding sequences were more conserved (0.4~1.8% polymorphism) compared to intergenic sequences (0.4~2.9% polymorphism). Based on Ks values and divergence time inferred in this study, the substitution rate among cp genomes of Panax and relatives was approximately 1.0 × 10−9 per year, which seems to be consistent with a previous report26.
Based on analysis of non-synonymous and synonymous SNPs, average Ks and Ka values in LSC, SSC and IR regions among Panax relatives were low and ranged from 0.0022 to 0.0236, and from 0.0011 to 0.0046, respectively. Overall, genes in IR regions showed lower Ks values than did those in LSC and SSC regions. The ratio of the two parameters, Ka/Ks, is defined as the degree of evolutionary change in plants27. Among the protein-coding genes in the ten Araliaceae cp genomes, atpF, atpE, ycf2, and rps15 were identified as being under positive selection (Fig. 2). Some of these genes were also reported to be related to evolution under positive selection pressure in Orobanchaceae and Sesamun indicum 28, 29. Among these five genes, atpF is a group II intron-containing gene with a known mechanism of splicing and heterogeneity in intron sequence30, 31.
Five cp genes conserved in the Araliaceae family
We identified five identical genes which showed no sequence variation among all 10 Araliaceae species. Among the five conserved genes, four (petN, psaJ, psbF, and psbN) were related to photosynthesis and one (rpl23) was related to the ribosomal large subunit (Fig. 2 and Supplementary Dataset S1). We compared the five genes with those of Daucus carota belonging to the sister family Apiaceae in the Apiales order and to those of Arabidopsis thaliana in the Brassicales order (Supplementary Fig. S4). The psbF gene in the Araliaceae species was identical to that of D. carota but differed from that of A. thaliana, suggesting that the gene may be highly conserved in the Apiales order. Four of the five conserved genes showed sequence variation with orthologs of D. carota, indicating that these four genes are specifically conserved only in the Araliaceae species (Supplementary Fig. S4). It will be interesting to further elucidate the role of these five genes in the evolution of Araliaceae family.
Nucleotide variation in ribosomal genes
Comparative analysis of eight 45S nrDNA sequences revealed that nucleotide polymorphism was much higher in nrITS regions than in transcribed ribosomal genes, 18S, 5.8S and 26S, as observed in other species32, 33. Most genetic diversity studies in plants have focused on ITS1 and ITS2 regions34,35,36,37,38. However, our study demonstrates that there is sufficient genetic diversity in the ribosomal gene regions for evolutionary analysis (Fig. 4 and Supplementary Tables S3 and S4). Among ribosomal genes, 26S was more divergent than 18S and 5.8S among the 10 Araliaceae species. Consistent with this finding, the 26S rDNA sequences have been applied for phylogenetic and divergence studies in yeasts39. The inter-genus level polymorphism was approximately three times higher than the intra-genus polymorphism (Supplementary Tables S3 and S4). Overall, our analysis suggests that 26S ribosomal gene sequences, in addition to the widely used ITS regions, can also be good targets for genetic diversity analyses for broad range and over genus-level taxon identification in the plant kingdom.
Phylogeny based on cpDNA and nrDNA of Araliaceae
Two monophyletic groups, Aralia-Panax and Eleutherococcus-Dendropanax, were simultaneously confirmed with both cpDNA-based and nrITS-based trees. The finding that Aralia and Panax were the most closely related genus is in accordance with previous reports15. The Panax genus was clearly classified as two groups, with diploid species spread in Southern Asia including P. notoginseng, P. vietnamensis, and P. japonicas and a tetraploid group widely distributed in Northern Asia and North America including P. ginseng and P. quinquifolius (Fig. 5a).
The cpDNA-derived topology reflects the uniparental inheritance of cpDNA and agrees with the previously proposed models for evolution of these five Panax species. However, our nrITS-derived tree did not clearly distinguish between tetraploid and diploid groups (Fig. 5b). The different topology for the Panax genus between the cpDNA-based and nrITS-based trees may be caused by differences inherent to cpDNA and nrDNA in hybrid species including polyploids. According to previous reports, nrDNA (or nrITS)-based trees occasionally lack clear resolution in hybrid species (included polyploids) or closely related groups40 owing to low substitution rates in nrDNA. The nrDNA homogenized by inter-locus concerted evolution can give rise to multiple-heterogeneous types, especially in hybrid speciation. Under homogenization conditions, analysis based solely on nrDNA (or nrITS) sequence can produce misleading phylogenetic results in closely related species and polyploids (e.g. allopolyploids). Therefore, we recently suggested utilization of both cpDNA-based and nrDNA-based trees as a method conducive to determining well-resolved taxonomical positions of inter-subspecies hybrids among Oryza AA genomes22.
Molecular clocks for speciation of Panax relatives
Based on previous reports, Araliaceae and Apiaceae diverged from a common ancestor in the Apiales order approximately 60.2 MYA41,42,43,44. In this study, we demonstrated that the Araliaceae family diverged into two monophyletic lineages approximately 8.81–10.59 MYA, followed by divergence of genera. Aralia and Panax, the two closest-related genera, diverged around 7.97–8.46 MYA, and Dendropanax and Eleutherococcus diverged after that (4.48–5.60 MYA).
In the Panax genus, we inferred speciation to have occurred more recently, around 2.89–3.20 MYA. This speciation step seems, like other plant speciation events, to be related to an Asian temperate climate change at that time period corresponding to uplift of the Himalaya-Tibetan plateau (in the late Cenozoic era)45,46,47. The Himalayas and southwestern China were suggested to be the base of speciation of Panax 8, and also reported to have a high rate of polyploidy because of the appearance of diverse species in the widespread alpine environment in these regions48. Earlier studies showed that P. ginseng underwent a recent whole-genome duplication 2.3 MYA49, followed by divergence of P. ginseng and P. quinquefolius 50. Our molecular clock estimation using complete cp coding genes supports the previous estimate and provides stronger evidence for that evolution scenario. P. ginseng might have evolved from allotetraploidization of diploid ancestors triggered by a large Asian temperate climate change. The diploid ancestors would have been isolated to the Northern hemisphere by uplift of the Himalaya-Tibetan plateau 2.89–3.20 MYA. We estimated divergence time between P. ginseng and P. quinquefolius to be approximately 0.72–0.87 MYA, i.e., following the recent genome duplication event at 2.3 MYA49. P. quinquefolius was suggested to have settled in North America by migration of the P. ginseng seeds via glacial movement of the Bering land bridge from eastern Asia to northern America and disjunction by geographical isolation 0.9–2.3 MYA51.
The complete sequences of cp genomes and 45S nrDNA obtained in this study will contribute toward further understanding of evolution in the Apiales order. Phylogenetic and evolutionary analysis of cp genomes indicated that the evolutionary divergence of Araliaceae family produced two monophyletic lineages, followed by diversification of genera and speciation. Our findings demonstrated that simultaneous utilization of cp genomes and nrDNAs support fine-scale resolution of the evolutionary and taxonomical relationships of Panax and relatives in the Araliaceae.
Materials and Methods
Plant samples of five Panax species (P. ginseng, P. quinquefolius, P. notoginseng, P. vietnamensis and P. japonicus) and four relative species (Aralia elata, Eleutherococcus sessiliflorus, E. senticosus and Dendropanax morbifera) were used in this study (Table 1). Leaves were harvested from P. ginseng, P. quinquefolius, and D. morbifera plants grown in a ginseng research field (Seoul National University, Suwon, Korea), and from P. vietnamensis wild plants collected from Dak To district, Kon Tum province, Vietnam. Roots of P. notoginseng and P. japonicus were collected from Dafang County, Guizhou province and Enshi County, Hubei province, China, respectively. Leaves of A. elata, E. sessiliflorus and E. senticosus were from the farm of the Susinogapy Corporation (Cheonan, Korea, www.susinogapy.com).
Genomic DNA isolation and whole-genome shotgun sequencing
Total genomic DNAs were isolated from tissue samples of the nine species using a modified cetyltrimethylammonium bromide (CTAB) method52 and examined using a spectrometer and agarose-gel electrophoresis. Illumina paired-end (PE) libraries were constructed with 300-bp insert size for each of eight species and sequenced using MiSeq or NextSeq platform by LabGenomics (Seongnam, Korea, http://www.labgenomics.co.kr/). Sequences of five species, P. notoginseng, P. vietnamensis, P. japonicus, A. elata, and E. sessiliflorus, were newly obtained in this study.
De novo assembly and annotation of cp genomes and 45S nrDNA sequences
Complete cp genomes and 45S nrDNA sequences were assembled by de novo assembly with the low-coverage whole-genome sequence (dnaLCW) method22. Assembly errors and gaps were manually corrected by mapping of raw PE reads22
Structural features and genes in cp genomes were predicted using the DOGMA program (http://dogma.ccbb.utexas.edu/) and manual curation based on BLAST searches. Circular maps of cp genomes were made using OGDRAW (http://ogdraw.mpimp-golm.mpg.de/). The structures of 45S nrDNA sequences were predicted by comparison with reported P. ginseng 45S nrDNA sequence (KM036295) and analyses using RNAmmer (http://www.cbs.dtu.dk/services/RNAmmer/), and BLAST searches.
Comparative analysis of cp genomes and 45S nrDNA sequences
Among the ten species, complete cp genomes of two species, A. undulata 12 (NC_022810) and E. senticosus 53 (NC_016430.1) were retrieved from GenBank. In addition, ITS1-5.8S-ITS2 (hereafter, nrITS) sequences of two species, A. undulata (AF273540) and E. senticosus (AB570259) were also retrieved from GenBank. To identify inter-species polymorphism, complete cp genome sequences of ten species were aligned and compared using MAFFT (http://mafft.cbrc.jp/alignment/server/) and mVISTA (http://genome.lbl.gov/vista/mvista/submit.shtml). For 45S nrDNA comparison, nrDNA sequences were extracted and compared among eight species, as for cp genomes. Tandem repeats (TRs) present on cp genomes of ten species were investigated using Tandem Repeat Finder (https://tandem.bu.edu/trf/trf.html).
Validation of polymorphic sites
Among polymorphic sites found in cp genomes, those with copy number variation (CNV) of TRs and InDels were selected for validation. PCR primer pairs were designed based on the flanking sequences of the selected polymorphic sites using Primer 3 program (http://bioinfo.ut.ee/primer3-0.4.0/) and used for genomic DNA PCR analyses. Amplified fragments were analyzed by separation in agarose gels and ethidium bromide staining.
Phylogenetic analysis and estimation of divergence time
Divergence time was first calculated based on Ks value. Ka and Ks values represent the number of non-synonymous and synonymous substitution per site, respectively. To estimate divergence time of these ten species in the Araliaceae family, 79 protein-coding sequences from each cp genome of the ten species were extracted and concatenated. Mean Ka and Ks values were calculated by pair-wise comparison of nucleotide substitution among the common cp protein-coding genes of the ten species, using the PAML program54. Divergence time (T) was given by T = Ks/2λ, where λ is approximately 1.0 × 10−9 substitutions per site per year26.
Second, divergence time was calculated using the Bayesian Inference (BI) method. A phylogenetic tree was generated based on BI analysis using BEAST version 1.8.155 with 95% highest posterior density. The analysis was conducted with the data of 79 chloroplast protein-coding gene sequences using a strict clock approach, with Yule prior on the tree, general time reversible (GTR + I + Γ) as a substitution model and the default priors for generating a random starting tree. The maximum probability clade trees were calculated using TreeAnnotator (version 1.8.1) and root age was constrained to be 60.2 million years ago (MYA) based on previously reported divergence time between Araliaceae and Apiaceae in the order Apiales41,42,43,44. The final tree was visualized in FigTree v1.3.1.
The nrITS-based tree was constructed using the ML method of MEGA6.0 with 1,000 bootstrap replicates. As outgroup sequences for phylogenetic analysis, cp genome56 (NC_008325.1) and nrITS sequence (AY552527.1) of Daucus carota (carrot) belonging to the Apiaceae family were used.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work is supported by Cooperative Research Program for Agriculture Science & Technology Development (Project No. PJ01100801), Rural Development Administration, the Bio & Medical Technology Development Program (NRF-2015M3A9A5030733) of the National Research Foundation (NRF) which was funded by the Ministry of Science, ICT, and Future Planning, and a grant (16172MFDS229) from Ministry of Food and Drug Safety in 2016, Republic of Korea.