Introduction

The genus Ardisia Sw. belongs to the family Primulaceae. It consists of more than 700 species, which are typically found in tropical America, Pacific Islands, eastern Indian peninsula, and east to the south of Asia1. Ardisia plants are usually used as traditional medicine in China due to their multiple medicinal properties2, such as anti-neoplastic, anti-hypertension, anti-inflammatory, anti-arthritis, anti-angiogenesis, and analgesic. Therefore, it has become the focus of numerous researchers to explore the effective chemical components with various pharmacological effects in Ardisia species.3,4,5,6,7,8,9.

Researchers have investigated the genetic relationships among 24 Ardisia species by morphological characteristics and the matK genetic marker10. However, their inferences on the evolutionary relationship of Ardisia species were affected by unstable morphological characteristics or insufficient genetic information, resulting in significant differences in research results. The unclear phylogenetic relationships among Ardisia species has seriously hindered the further development of these important resources.

Chloroplasts are important photosynthetic organelles in green plants and have a set of genetic materials independent of the nucleus11,12. The chloroplast genome of angiosperms is generally considered to be a closed circular DNA molecule with a size between 120 and 160 kb and it has a typical conserved quadripartite structure, which consists of a large single copy (LSC) region, a small single copy (SSC) region and two copies of inverted repeats (IR)13,14. Furthermore, chloroplast DNA is maternally inherited, making them suitable for the analysis of phylogenetic relationships among species, especially for the closely related species15,16. With the application of high-throughput sequencing technology, it is now possible to reveal the phylogenetic relationships and develop molecular makers among species by using the genetic information of the whole chloroplast genome in many plant groups17,18. Therefore, we attempted to study the chloroplast genomes of Ardisia crenata Sims, A. gigantifolia Stapf, A. mamillata Hance and A. villosa Roxb. In order to elucidate the molecular evolution and genetic relationships of these species.

In this study, chloroplast genomes of four Ardisia species were sequenced. Firstly, we elucidated the size and structure of chloroplast genome. Secondly, the genomic repeats, and the variation of simple sequence repeats (SSRs) were identified. Finally, the phylogenetic relationships of the four Ardisia species and other Primulaceae species with available complete chloroplast genomes were analyzed. This work is of great significance for future research on the adaptive evolution of Ardisia species.

Results

Chloroplast genomes features of Ardisia species

The complete chloroplast genomes in the four Ardisia species ranged in size from 156,550 bp (A. crenata) to 156,734 bp (A. mamillata) (Table 1). All plastomes displayed typical quadripartite circle molecules consisting of a pair of IR regions (25,937–26,147 bp) separated by an LSC region (86,009–86,333 bp) and an SSC region (18,347–18,495 bp). All four Ardisia species shared uniform overall GC contents, ranging from 37.1 to 37.3%. What’s more, the GC contents in the LSC, IR, SSC regions were 34.9–35.1%, 42.9–43.1% and 30.1–30.4%, respectively.

Table 1 Summary of the general features of four Ardisia chloroplast genomes.

After annotation, we identified a total of 114 unique genes, including 80 protein-coding genes (CDS), 30 tRNAs and 4 four rRNAs (Table 2, Table S2). Out of these, 19 genes were duplicated in the IR regions, including seven tRNAs, four rRNAs (rrn16, rrn23, rrn4.5 and rrn5) and eight protein-coding genes (Fig. 1). The GC content of the first, second and third codon sites in the CDS regions of these four Ardisia chloroplast genomes were 40.3%/34.6%/38.1% (A. gigamtifolia), 32.3%/41.6%/38.6% (A. crenata), 42.7%/38.5%/31.3% (A. mamillata), 38.1%/31.3%/43.3%(A. villosa), respectively (Table S2). Furthermore, there were seventeen genes that harbored introns, among which fifteen distinct genes contain only a single intron, whereas two genes (ycf3, clpP) harbored two introns. The trnK-UUU gene owned the longest intron (2534–2550 bp) and its intron region contained the matK gene (Table S3). Across these chloroplast genomes, the rps12 was found to trans-splice in the IR-LSC region in which the 3’-end is duplicated in the IR regions and the 5’-end located in the LSC region.

Table 2 Statistics on the number of genes and nucleotide positions in the chloroplast genomes of four Ardisia species.
Figure 1
figure 1

Plastome features of Ardisias. Genes exhibited inside the circle are transcribed in a clockwise direction and those outside are transcribed counterclockwise. The gray arrow represents gene direction. Different genes are color-coded. The dashed gray area in the inner circle shows the percent GC content of genes, whereas the lighter gray corresponds to AT content.

Codon usage

In these Ardisia chloroplast genomes, the protein-coding genes presented a total of 52,183 to 52,244 codons, with the A. mamillata containing the most abundant codons and the A. crenata containing the least (Table S4). The most frequently used codon in the four plastomes was the UUU that encoded phenylalanine (Phe), while the least-used codon was the GCG encoded alanine (Ala). Among the four species of Ardisia, the number of codons with RSCU > 1 was equally and the RSCU values of the same codons in our four plastomes were slightly different (Fig. 2, Table S4). The results of RSCU in the four Ardisia species showed A or T was biased toward a higher nucleotide frequency than C or G at the third codon position and the result was similar to other angiosperms chloroplast genomes.

Figure 2
figure 2

Codon content for protein-coding sequences in four Ardisia chloroplast genomes. The x-coordinate represents the 20 amino acids and the terminators, while the y-coordinate represents the RSCU value of the corresponding amino acid of each species. The species of each amino acid from left to right are A. gigantifolia, A. crenata, A. mamillata and A. villosa, respectively. The different colors of each amino acid corresponding to the codon of the same color below. Asp asparticacid, Leu leucine, Ile isoleucine, Met methionine, Asn asparagine, Val valine, Phe phenylalanine, Ser serine, Pro proline, Gln glutarnine, Lys lysine, Glu glutamic acid, Thr threonine, Ala alanine, Tyr tyrosine, His histidine, Cys cysteine, Trp tryptophan, Arg arginine, Gly glycine.

RNA editing analysis

We used the PREP-Cp database19 to predict possible RNA editing sites in the chloroplast genomes of four Ardisia species. The results revealed that (Table S5) among the 80 protein-coding genes in the chloroplast genomes of the four species, RNA editing occurred in 14 genes in the A. gigantifolia and 15 genes occurred in the other three species, containing 47–50 editing sites. All editing events involved C to U conversion and also caused changes in amino acids. A statistical analysis of the codon locations showed that 17 mutations occurred in the first position of the codon, while the remaining were found in the second position. There were eleven types of amino acid transformation, including T → I, P → S, P → L, S → L, A → V, H → Y, L → F, S → F, T → M, R → W and R → C, among which the S → L transformation was the most common. Among all genes undergoing editing, rpoC2 and rpoB owned the most editing sites (7–10), followed by ndhB (5).

Repeat and simple sequence repeat (SSR) analysis

Using PREter and Tandem online tools, the repetitive sequences of four Ardisia plastomes were analyzed, and the quantitative comparison maps of reverse (R), forward (F), palindromic (P), complement (C) and tandem(T) repeat sequences were summarized (Fig. 3).

Figure 3
figure 3

Repeat analysis of four Ardisias. The ordinates represent the number of (a) different repeat types, (b) tandem repeats, (c) forward repeats and (d) palindromic repeats in four Ardisia chloroplast genomes, respectively. Repeats with different lengths are indicated in different colors, the ordinate represents the number of repeats.

As shown in Fig. 3a, there was no significant difference in the number of repeated sequences among the four plants, with the number of 50–55 long repeats and 34–41 tandem repeats. It should be noted that the number of F and P duplications in each species was about 1:1, while no C duplications were observed in any of the three species except for A. mamillata.

Simple sequences repeats (SSRs), the DNA sequences consisting of multiple repeats of 1–6 nucleotide(s), are widely distributed in eukaryotes. Using MISA analysis, six types (i.e., mono-, di-, tri-, tetra-, penta-, and hexa-nucleotides) SSRs were detected in four plastomes of Ardidia species, and each chloroplast genome was found to contain 270 (A. gigantifolia) to 279 (A. crenata) SSRs. A.crenta and A. mamillata contained five types of SSRs, excluding the hexanucleotide, while the A. gigantifolla and  A.villosa contained six and four types, respectively. (Fig. 4, Table S6). For the different unit size, mononucleotide SSRs were most highly abundant (55.5%, 55.9%, 55.5%, 56.2% in A. gigantifolla, A. crenata, A. mamillata and A. villosa, respectively).

Figure 4
figure 4

SSR loci analysis of four Ardisias.

In conclusion, the comparative analysis of the repeats and SSRs in the four Ardisia chloroplast genomes showed that (Figs. 3, 4). A. crenata, A. mamillata and A.villosa were resembling with each other, while the repeats and SSR types of the A. gigantifolia were more than those of the other chloroplast genomes. Probably because the A. gigantifolia belongs to the serrate group and the other three plants belong to the crenate group of the Ardisia genus, this indicated that there were differences between different groups of species.

Comparative analysis of the Ardisia plastomes

As mentioned above, the typical quadripartite structure of the chloroplast genome consists of two different single-copy regions and two inverted repeat regions. In order to get a better understanding of the IR region evolution, a comparative study was performed between the four Ardisia chloroplast genomes and several related species, including Ardisia solanacea (Poir.) Roxb., Embelia vestita Roxb., and Myrsine stolonifera (Koidz.) E. Walker, investigating the length of the IR region and the variation between the IR regions and SC (LSC and SSC) boundaries. In general, the boundaries between the IRs and LSC/SSC regions of seven Primulaceae species exhibit a similar pattern, in which the SSC/IRa node was located in the ycf1 gene and the LSC/IRb position was located in the rps19 gene. Consequently, due to this cross-boundaries phenomenon, corresponding incomplete copies of ycf1 and rps19 appear at the boundaries of IRb/SSC and IRa/LSC junction, respectively (ψycf1, ψrps19). Meanwhile, the ndhF gene was located at the junction of IRb and SSC, and its 3' end was overlapped with ψycf1 in the other six plants with the exception of A. mamillata. The trnH gene was present at the junction of IRa/LSC (Fig. 5).

Figure 5
figure 5

Comparison of the IRs, LSC, SSC boundary features among the seven Primulaceae chloroplast genomes. The number represents the distance between the ends of the genes and the boundaries. Ψ: pseudogenes.

The mVISTA platform was conducted for comparing the overall identity among the four Ardisia chloroplast genomes and three other reported Primulaceae species. As illustrated in Fig. 6, all seven species had similar chloroplast gene sequence and structure, and the non-coding regions showed more variation than the coding regions as colored in purple bars. It is noteworthy that the two IR regions were more conservative than the remaining two regions. Further study found that the gene spacer regions were significantly different among the chloroplast genomes of the seven Primulaceaes, for example, trnT-trnL, trnT-psbD, rpl32-ccsA, ycf1, ndhF-rpl32 and ccsA-ndhD.

Figure 6
figure 6

Overall sequence identity of plastomes belonged to seven Primulaceae species. Gray arrows indicate the transcriptional direction of genes, purple bars represent coding regions, and pink bars represent non-coding regions. The vertical scales represent percentage similarity ranging from 50 to 100%.

To explore the sequence differences among Ardisia chloroplast genomes, the number of nucleotide substitutions were counted and genetic distances were calculated based on the Kimura-2-parameter through the MEGA tool (Table 3). The results showed that the number of nucleotide substitutions of the four species was 171–1237 and the genetic distance was 0.001050–0.007980. In general, the number of nucleic variations in the A. gigantifolia chloroplast genome sequence was higher than that of A. crenata, A. mamillata, and A. villosa. Similarly, the genetic distance was greater. It confirmed the species diversity among different groups of the Ardisia genus.

Table 3 Genetic distance analysis of four Ardisia chloroplast genomes.

We further calculated the nucleotide polymorphisms of 800 bp window. Among the four Ardisia chloroplast genomes, the Pi values varied from 0 to 0.045 and detected five higher-variable regions (Pi > 0.012), namely trnT-psbD, ndhF-rpl32, rpl32-ccsA, ccsA-ndhD and ycf1 (Fig. 7).

Figure 7
figure 7

Comparative analysis of the nucleotide variability by Pi values of the four Ardisia CP genomes presented in a sliding window (window length: 800 bp; step size: 200 bp). The X-axis represents the position of the point in the window, while the Y-axis represents the nucleotide polymorphism in each window.

Phylogenetic analysis

We downloaded 46 published chloroplast genome sequences of Primulaceaes from the NCBI database and established an ML tree with four Ardisia chloroplast genomes in this study to clarify better the evolutionary relationships within the Primulaceae (Fig. 8). The Rhododendron simsii Planch. is an extraneous group of Ericaceae, and the NCBI accession numbers of 47 species are listed in Table S1. It could be seen from Fig. 8 that among the genera of Primulaceae, Tapeinosperma was observed to be a sister lineage of Ardisia with strong bootstrap values of 100%. In addition, The A. mamillata, A. villosa, A. polysticta, and A. crenata which belong to the crispardisia group from the Ardisia were well clustered and could therefore be distinguished from A. gigantifolia of the bladhia group.

Figure 8
figure 8

Phylogenetic tree reconstruction of 51 plants using maximum likelihood methods based complete chloroplast genomes. The bootstrap values were labeled on the evolutionary branches and four species in this study were labeled with solid black dots.

Discussion

The results indicated that the chloroplast genomes of four Ardisia species were similar in structure, genome length, and organization. The length of the four chloroplast genomes ranged from 156,550 bp (A. crenata) to 156,734 bp (A. mamillata), it was within the size range of chloroplast genomes in other angiosperms20,21. The Ardisia chloroplast genomes displayed the typical quadripartite structure with similar GC content, indicating that the almost identical levels among the Ardisia chloroplast genomes. The IR regions had the highest GC content, which might be caused by the decrease of AT nucleotides in the four rRNA genes (rrn16, rrn23, rrn4.5, rrn5)22. Compared to the previously published data, the structural features of the Ardisia chloroplast genomes were highly similar to those of other Primulaceae chloroplast genomes23.

Introns, a group of self-catalytic ribozymes that could splice their own excision from mRNA, tRNA, and rRNA precursors, help to infer phylogenetic relationships. The length of exons and introns in genes was important information in plant chloroplast genome24,25. In this study, there were two genes (ycf3 and clpP) including two introns in four Ardisia chloroplast genome. The ycf3 has been reported to be a gene closely related to photosynthesis26. Therefore, the acquisition of ycf3 gene will make an important contribution to the further study of the Ardisia chloroplast.

The synonymous codons usually only mutate in the third position to adapt to the existence of gene mutations and natural selections27. The relative synonymous codon usages (RSCU) refers to the frequency of specific codons in synonymous codons for a certain amino acid28,29. The above results showed that the codon preference of the four Ardisia species is high consistency, which is congruent with other genera.

After being transcribed, chloroplast mRNA molecule usually undergoes RNA editing, a process of C-to-U conversion is performed at specific sites to regulate gene expression and translation in the chloroplast. RNA editing plays an important regulatory role in plant growth, development, stress response, and other physiological and biochemical processes30. Identification of RNA editing sites will benefit the study of related biological functions. In our work, potential RNA editing sites were identified in 14–15 protein-coding genes of four Ardisia species. All editing events involved C to U conversion and also caused changes in amino acids31,32, while the S → L transformation was the most common. It was similar to the composition characteristics of RNA editing in chloroplast genomes of other plants.

Repeat sequences detected in plastomes have been proven to be correlated with rearrangement, sequence divergence, and recombination33. They provide vital information for understanding the evolutionary history and sequence divergence of plant species34,35. With the advantages of high polymorphism, stability and repeatability, SSRs have been widely used in genetic diversity analysis, species identification, and molecular breeding36,37,38. We detected six types of SSRs in four plastomes of Ardidia species. According to the result, these high variabilities of SSRs may provide strong value and evidence for molecular breeding and identification of medicinal plants.

IR regions are the most conserved regions in the chloroplast genomes39. Frequent expansions and contractions at the junctions of SSR and LSC with IRs have been recognized as evolutionary signals for which illustrating the relationships among taxa40. It is believed that the contractions and expansions of IR regions in angiosperms are generally accompanied by the changes in the length and distribution of ycf1 and rps1941. We found that the boundaries between the IRs and LSC/SSC regions of seven Primulaceae species exhibit a similar pattern, in which the ycf1 gene and the rps19 gene were located in the SSC/IRa node and the LSC/IRb node respectively.

Studies have shown that mutations in chloroplast genomes can be concentrated and become hotspots for identification and defined as DNA barcoding42,43. It is a remarkable fact that single copy regions’ mutation rate is significantly higher than reverse repeat regions. Among the four Ardisia chloroplasts, five higher-variable regions (trnT-psbD, ndhF-rpl32, rpl32-ccsA, ccsA-ndhD and ycf1) were detected, which can be selected as the DNA barcoding of Ardisia.

In recent years, the chloroplast genome has become an indispensable tool to investigate phylogenetic development of species, and it could be widely used in phylogenetic reconstruction of plants at different taxonomic levels, such as order, family, genus and species44,45. The genetic relationships of Ardisia in Primulaceae are still somewhat uncertain. Phylogenetic relationships of Primulaceae species were inferred based on the available plastomes using ML methods. As indicated by Fig. 8, all branches of the phylogenetic tree are strongly supported. The phylogenetic analyses revealed that four chloroplast genomes presented a close relationship with other reported Ardisias, and different groups of Ardisia could be distinguished from each other (crispardisia group and bladhia group), which clearly indicated that these phylogenetic results are consistent with morphological taxa. Hence, in order to better elucidate the phylogenetic relationships of Primulaceae, more chloroplast genomes are needed to be sequenced.

Conclusions

The complete chloroplast genomes of A. crenata, A. gigantifolia, A. villosa, and A. mamillata were sequenced and characterized. These four plastomes were presented as circular molecular with typical quadripartite structure and shared a similar gene composition and the base content. In the Ardisia chloroplast genomes, mononucleotide SSR and tandem repeats were dominant. Through the analysis of the variable sites, five potential mutation hotspots were found, which laid the foundation for the molecular identification of this genus. Furthermore, four chloroplast genomes presented a close relationship with other reported Ardisia species and confirmed the sister linage with Tapeinosperma. Interestingly, different groups of Ardisia could be distinguished from each other. In a nutshell, these highly variable sites and the complete chloroplast genomes provided sufficient information for contributing to the further study of the molecular evolution and genetic relationship among Primulaceae.

Methods

Plant materials and DNA extraction

Four Ardisia species used in this experiment were all collected from the medicinal botanical garden of Guangzhou University of Chinese Medicine (GUCM). These species have not been included in the list of national key protected plants, and permission was not necessary for collecting them. Experimental research on plants, including the collection of plant materials, complies with relevant institutional, national, and international guidelines and legislation. The authenticity of the plant materials has been verified by Professor Ji Lin and Dr. Guifang Zhang of GUCM. Voucher specimens of A. crenata (voucher number 441823LY1019), A. gigantifolia (441823LY0880), A. mamillata (441823LY0451), and A. villosa (441823LY0951) were deposited at the Chinese medicine herbarium of GUCM.

Fresh young leaves of A. crenata, A. gigantifolia, A. mamillata, and A. villosa were obtained, and be conducted to DNA extraction with the Plant Genomic DNA Kit (Tiangen, China). Then, the DNA concentration and quality of each sample were checked by Ultraviolet Spectroscopy with a Nanodrop-2000 spectrometer (Nanodrop Technologies, Wilmington, DE, USA) and agarose gel electrophoresis.

Genome sequencing, assembly and annotation

Approximately 3–5 µg of total DNA was sheared into short-insert (350 bp) fragments, followed by library construction. Then, these libraries were evaluated and conducted to genome sequencing with an Illumina HiSeq 4000 platform, generating approximately 5 GB of raw data for each sample.

Clean reads were retained after filtering the low-quality reads and removing adapters using the Trimmomatic (v0.39, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany). Cp-like reads were extracted from clean Reads according to sequence similarity using the bwa software (v0.7.17)46, then these cp-like reads were assembled using the SPAdes (v3.13.1) to generate contigs. These contigs were plotted against reference plastome to adjust their orientation and gaps were filled with the GapCloser47. All clean data was then mapped to the final assembly sequence and visualized by the IGV tool to get an overview of the reads coverage, resulting in complete chloroplast genomes.

Preliminarily gene annotation of four complete chloroplast genomes was performed by the GeSeq online tool (https://chlorobox.mpimp-golm.mpg.de/geseq.html) with default parameters48. and further revised manually based on the referential chloroplast genome of A.solanacea (NC_045098.1). Finally, a gene map of the annotated Ardisia chloroplast genome was drew using the OGDRAW tool (http://ogdraw.mpimp-golm.mpg.de/)49.

Codon usage and prediction of RNA editing sites

To examine the deviation in synonymous codon usage, the amino acid frequency and the relative synonymous codon usage (RSCU) were analyzed using the Molecular Evolutionary Genetic Analysis (MEGA, version 7)50. To predict the possible RNA editing sites in the four Ardisia chloroplast genomes, the online tool of Predictive RNA Editor for Plants (PREP, http://prep.unl.edu/)19 was adapted with a cutoff value set as 0.8.

Analysis of repeat elements in four Ardisias

REPuter (https://bibiserv2.cebitec.uni-Bielefeld.de/reputer)51 was used to identify long repeat sequences with a Hamming distance set as three and a minimum repeat size set as 30 bp. Additional, Tandem repeats finder (https://tandem.bu.edu/trf/trf.html)52 was used to detect tandem repeats and the SSRs in the Ardisia chloroplast genomes were identified using MISA software53, with alignment parameters set to 2, 7 and 7 for matches, mismatches and indels.

Comparative analysis

The mVISTA tool (http://genome.lbl.gov/vista/index.shtml) was used to investigate the divergences between the Ardisia complete chloroplast genomes and three referential Primulaceae species in the Shuffle-LAGAN mode. Moreover, IR expansions/contractions were summarized manually54.

Those selected chloroplast genomes were aligned using the MAFFT (v7.419) software with a default setting and then adjusted manually by Se-Al 2.02455. Next, MEGA7 was used to calculate the single nucleotide variants (SNV) and the mean genetic distance between the chloroplast genome sequence of Ardisias. Additionally, DnaSP v5.10 was used to calculate the Pi value and the SNP variation sites of the four Ardisia chloroplast genomes. The step size was set to 200 bp with an 800 bp window length56.

Phylogenetic analysis

A phylogenetic analysis was conducted using the complete chloroplast genomes of four Ardisia species and forty-six Primulaceae species, with an outgroup was Rhododendron simsii which belongs to Ericaceae. These complete chloroplast genomes were downloaded from the NCBI database (Table S1) and multi-sequence alignment was performed by the MAFFT program. Finally, the phylogenetic analyses with Maximum likelihood (ML) was conducted using the GTR + G substitution model, which was selected based on model screening. Bootstrap analysis was executed with 1,000 replicates and TBR branch swapping.