Comparative genomic study on the complete plastomes of four officinal Ardisia species in China

Ardisia Sw. (Primulaceae) is naturally distributed in tropical and subtropical areas. Most of them possess edible and medicinal values and are popular in clinical and daily use in China. However, ambiguous species delineation and genetic information limit the development and utilization of this genus. In this study, the chloroplast genomes of four Ardisia species, namely A. gigantifolia Stapf, A. crenata Sims, A. villosa Roxb. and A. mamillata Hance, were sequenced, annotated, and analyzed comparatively. All the four chloroplast genomes possess a typical quadripartite structure, and each of the genomes is about 156 Kb in size. The structure and gene content of the Ardisia plastomes were conservative and showed low sequence divergence. Furthermore, we identified five mutation hotspots as candidate DNA barcodes for Ardisia, namely, trnT-psbD, ndhF-rpl32, rpl32-ccsA, ccsA-ndhD and ycf1. Phylogenetic analysis based on the whole-chloroplast genomes data showed that Ardisia was sister to Tapeinosperma Hook. f. In addition, the results revealed a great topological profile of Ardisia’s with strong support values, which matches their geographical distribution patterns. Summarily, our results provide useful information for investigations on taxonomic differences, molecular identification, and phylogenetic relationships of Ardisia plants.

The genus Ardisia Sw. belongs to the family Primulaceae. It consists of more than 700 species, which are typically found in tropical America, Pacific Islands, eastern Indian peninsula, and east to the south of Asia 1 . Ardisia plants are usually used as traditional medicine in China due to their multiple medicinal properties 2 , such as anti-neoplastic, anti-hypertension, anti-inflammatory, anti-arthritis, anti-angiogenesis, and analgesic. Therefore, it has become the focus of numerous researchers to explore the effective chemical components with various pharmacological effects in Ardisia species. [3][4][5][6][7][8][9] .
Researchers have investigated the genetic relationships among 24 Ardisia species by morphological characteristics and the matK genetic marker 10 . However, their inferences on the evolutionary relationship of Ardisia species were affected by unstable morphological characteristics or insufficient genetic information, resulting in significant differences in research results. The unclear phylogenetic relationships among Ardisia species has seriously hindered the further development of these important resources.
Chloroplasts are important photosynthetic organelles in green plants and have a set of genetic materials independent of the nucleus 11,12 . The chloroplast genome of angiosperms is generally considered to be a closed circular DNA molecule with a size between 120 and 160 kb and it has a typical conserved quadripartite structure, which consists of a large single copy (LSC) region, a small single copy (SSC) region and two copies of inverted repeats (IR) 13,14 . Furthermore, chloroplast DNA is maternally inherited, making them suitable for the analysis of phylogenetic relationships among species, especially for the closely related species 15,16 . With the application of high-throughput sequencing technology, it is now possible to reveal the phylogenetic relationships and develop molecular makers among species by using the genetic information of the whole chloroplast genome in many plant groups 17,18 . Therefore, we attempted to study the chloroplast genomes of Ardisia crenata Sims, A. gigantifolia Stapf, A. mamillata Hance and A. villosa Roxb. In order to elucidate the molecular evolution and genetic relationships of these species.
In this study, chloroplast genomes of four Ardisia species were sequenced. Firstly, we elucidated the size and structure of chloroplast genome. Secondly, the genomic repeats, and the variation of simple sequence repeats (SSRs) were identified. Finally, the phylogenetic relationships of the four Ardisia species and other Primulaceae Codon usage. In these Ardisia chloroplast genomes, the protein-coding genes presented a total of 52,183 to 52,244 codons, with the A. mamillata containing the most abundant codons and the A. crenata containing the least (Table S4). The most frequently used codon in the four plastomes was the UUU that encoded phenylalanine (Phe), while the least-used codon was the GCG encoded alanine (Ala). Among the four species of Ardisia, the number of codons with RSCU > 1 was equally and the RSCU values of the same codons in our four plastomes were slightly different (Fig. 2, Table S4). The results of RSCU in the four Ardisia species showed A or T was biased toward a higher nucleotide frequency than C or G at the third codon position and the result was similar to other angiosperms chloroplast genomes.

RNA editing analysis.
We used the PREP-Cp database 19 to predict possible RNA editing sites in the chloroplast genomes of four Ardisia species. The results revealed that (Table S5) among the 80 protein-coding genes in the chloroplast genomes of the four species, RNA editing occurred in 14 genes in the A. gigantifolia and 15 genes occurred in the other three species, containing 47-50 editing sites. All editing events involved C to U conversion and also caused changes in amino acids. A statistical analysis of the codon locations showed that 17 mutations occurred in the first position of the codon, while the remaining were found in the second position. There were eleven types of amino acid transformation, including T → I, www.nature.com/scientificreports/ L → F, S → F, T → M, R → W and R → C, among which the S → L transformation was the most common. Among all genes undergoing editing, rpoC2 and rpoB owned the most editing sites (7-10), followed by ndhB (5).
Repeat and simple sequence repeat (SSR) analysis. Using PREter and Tandem online tools, the repetitive sequences of four Ardisia plastomes were analyzed, and the quantitative comparison maps of reverse (R), forward (F), palindromic (P), complement (C) and tandem(T) repeat sequences were summarized (Fig. 3). As shown in Fig. 3a, there was no significant difference in the number of repeated sequences among the four plants, with the number of 50-55 long repeats and 34-41 tandem repeats. It should be noted that the number of F and P duplications in each species was about 1:1, while no C duplications were observed in any of the three species except for A. mamillata.
Simple sequences repeats (SSRs), the DNA sequences consisting of multiple repeats of 1-6 nucleotide(s), are widely distributed in eukaryotes. Using MISA analysis, six types (i.e., mono-, di-, tri-, tetra-, penta-, and hexanucleotides) SSRs were detected in four plastomes of Ardidia species, and each chloroplast genome was found   Comparative analysis of the Ardisia plastomes. As mentioned above, the typical quadripartite structure of the chloroplast genome consists of two different single-copy regions and two inverted repeat regions. In order to get a better understanding of the IR region evolution, a comparative study was performed between the four Ardisia chloroplast genomes and several related species, including Ardisia solanacea (Poir.) Roxb., Embelia vestita Roxb., and Myrsine stolonifera (Koidz.) E. Walker, investigating the length of the IR region and the variation between the IR regions and SC (LSC and SSC) boundaries. In general, the boundaries between the IRs and LSC/SSC regions of seven Primulaceae species exhibit a similar pattern, in which the SSC/IRa node was located in the ycf1 gene and the LSC/IRb position was located in the rps19 gene. Consequently, due to this crossboundaries phenomenon, corresponding incomplete copies of ycf1 and rps19 appear at the boundaries of IRb/ SSC and IRa/LSC junction, respectively (ψycf1, ψrps19). Meanwhile, the ndhF gene was located at the junction of IRb and SSC, and its 3' end was overlapped with ψycf1 in the other six plants with the exception of A. mamillata. The trnH gene was present at the junction of IRa/LSC (Fig. 5).
The mVISTA platform was conducted for comparing the overall identity among the four Ardisia chloroplast genomes and three other reported Primulaceae species. As illustrated in Fig. 6, all seven species had similar chloroplast gene sequence and structure, and the non-coding regions showed more variation than the coding regions as colored in purple bars. It is noteworthy that the two IR regions were more conservative than the remaining two regions. Further study found that the gene spacer regions were significantly different among the chloroplast genomes of the seven Primulaceaes, for example, trnT-trnL, trnT-psbD, rpl32-ccsA, ycf1, ndhF-rpl32 and ccsA-ndhD.
To explore the sequence differences among Ardisia chloroplast genomes, the number of nucleotide substitutions were counted and genetic distances were calculated based on the Kimura-2-parameter through the MEGA tool ( Table 3). The results showed that the number of nucleotide substitutions of the four species was 171-1237 and the genetic distance was 0.001050-0.007980. In general, the number of nucleic variations in the A. gigantifolia chloroplast genome sequence was higher than that of A. crenata, A. mamillata, and A. villosa. Similarly, the genetic distance was greater. It confirmed the species diversity among different groups of the Ardisia genus.
Phylogenetic analysis. We downloaded 46 published chloroplast genome sequences of Primulaceaes from the NCBI database and established an ML tree with four Ardisia chloroplast genomes in this study to clarify better the evolutionary relationships within the Primulaceae (Fig. 8). The Rhododendron simsii Planch. is an extraneous group of Ericaceae, and the NCBI accession numbers of 47 species are listed in Table S1. It could be seen from Fig. 8 that among the genera of Primulaceae, Tapeinosperma was observed to be a sister lineage of Ardisia with strong bootstrap values of 100%. In addition, The A. mamillata, A. villosa, A. polysticta, and A. crenata which belong to the crispardisia group from the Ardisia were well clustered and could therefore be distinguished from A. gigantifolia of the bladhia group. www.nature.com/scientificreports/

Discussion
The results indicated that the chloroplast genomes of four Ardisia species were similar in structure, genome length, and organization. The length of the four chloroplast genomes ranged from 156,550 bp (A. crenata) to 156,734 bp (A. mamillata), it was within the size range of chloroplast genomes in other angiosperms 20,21 . The Ardisia chloroplast genomes displayed the typical quadripartite structure with similar GC content, indicating that the almost identical levels among the Ardisia chloroplast genomes. The IR regions had the highest GC content, which might be caused by the decrease of AT nucleotides in the four rRNA genes (rrn16, rrn23, rrn4.5, rrn5) 22 . Compared to the previously published data, the structural features of the Ardisia chloroplast genomes were highly similar to those of other Primulaceae chloroplast genomes 23 . Introns, a group of self-catalytic ribozymes that could splice their own excision from mRNA, tRNA, and rRNA precursors, help to infer phylogenetic relationships. The length of exons and introns in genes was important information in plant chloroplast genome 24,25 . In this study, there were two genes (ycf3 and clpP) including two introns in four Ardisia chloroplast genome. The ycf3 has been reported to be a gene closely related to photosynthesis 26 . Therefore, the acquisition of ycf3 gene will make an important contribution to the further study of the Ardisia chloroplast.
The synonymous codons usually only mutate in the third position to adapt to the existence of gene mutations and natural selections 27 . The relative synonymous codon usages (RSCU) refers to the frequency of specific codons in synonymous codons for a certain amino acid 28,29 . The above results showed that the codon preference of the four Ardisia species is high consistency, which is congruent with other genera.
After being transcribed, chloroplast mRNA molecule usually undergoes RNA editing, a process of C-to-U conversion is performed at specific sites to regulate gene expression and translation in the chloroplast. RNA editing plays an important regulatory role in plant growth, development, stress response, and other physiological and biochemical processes 30 . Identification of RNA editing sites will benefit the study of related biological functions. In our work, potential RNA editing sites were identified in 14-15 protein-coding genes of four Ardisia species. All editing events involved C to U conversion and also caused changes in amino acids 31,32 , while the S → L transformation was the most common. It was similar to the composition characteristics of RNA editing in chloroplast genomes of other plants.
Repeat sequences detected in plastomes have been proven to be correlated with rearrangement, sequence divergence, and recombination 33 . They provide vital information for understanding the evolutionary history and sequence divergence of plant species 34,35 . With the advantages of high polymorphism, stability and repeatability, SSRs have been widely used in genetic diversity analysis, species identification, and molecular breeding [36][37][38] . We detected six types of SSRs in four plastomes of Ardidia species. According to the result, these high variabilities of SSRs may provide strong value and evidence for molecular breeding and identification of medicinal plants.
IR regions are the most conserved regions in the chloroplast genomes 39 . Frequent expansions and contractions at the junctions of SSR and LSC with IRs have been recognized as evolutionary signals for which illustrating the relationships among taxa 40 . It is believed that the contractions and expansions of IR regions in angiosperms www.nature.com/scientificreports/ are generally accompanied by the changes in the length and distribution of ycf1 and rps19 41 . We found that the boundaries between the IRs and LSC/SSC regions of seven Primulaceae species exhibit a similar pattern, in which the ycf1 gene and the rps19 gene were located in the SSC/IRa node and the LSC/IRb node respectively. Studies have shown that mutations in chloroplast genomes can be concentrated and become hotspots for identification and defined as DNA barcoding 42,43 . It is a remarkable fact that single copy regions' mutation rate is significantly higher than reverse repeat regions. Among the four Ardisia chloroplasts, five higher-variable regions (trnT-psbD, ndhF-rpl32, rpl32-ccsA, ccsA-ndhD and ycf1) were detected, which can be selected as the DNA barcoding of Ardisia.
In recent years, the chloroplast genome has become an indispensable tool to investigate phylogenetic development of species, and it could be widely used in phylogenetic reconstruction of plants at different taxonomic levels, such as order, family, genus and species 44,45 . The genetic relationships of Ardisia in Primulaceae are still somewhat uncertain. Phylogenetic relationships of Primulaceae species were inferred based on the available plastomes using ML methods. As indicated by Fig. 8, all branches of the phylogenetic tree are strongly supported. The phylogenetic analyses revealed that four chloroplast genomes presented a close relationship with other  www.nature.com/scientificreports/ reported Ardisias, and different groups of Ardisia could be distinguished from each other (crispardisia group and bladhia group), which clearly indicated that these phylogenetic results are consistent with morphological taxa. Hence, in order to better elucidate the phylogenetic relationships of Primulaceae, more chloroplast genomes are needed to be sequenced.

Conclusions
The complete chloroplast genomes of A. crenata, A. gigantifolia, A. villosa, and A. mamillata were sequenced and characterized. These four plastomes were presented as circular molecular with typical quadripartite structure and shared a similar gene composition and the base content. In the Ardisia chloroplast genomes, mononucleotide SSR and tandem repeats were dominant. Through the analysis of the variable sites, five potential mutation hotspots were found, which laid the foundation for the molecular identification of this genus. Furthermore, four chloroplast genomes presented a close relationship with other reported Ardisia species and confirmed the sister linage with Tapeinosperma. Interestingly, different groups of Ardisia could be distinguished from each other. In a nutshell, these highly variable sites and the complete chloroplast genomes provided sufficient information for contributing to the further study of the molecular evolution and genetic relationship among Primulaceae. Genome sequencing, assembly and annotation. Approximately 3-5 µg of total DNA was sheared into short-insert (350 bp) fragments, followed by library construction. Then, these libraries were evaluated and conducted to genome sequencing with an Illumina HiSeq 4000 platform, generating approximately 5 GB of raw data for each sample.

Methods
Clean reads were retained after filtering the low-quality reads and removing adapters using the Trimmomatic (v0.39, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany). Cp-like reads were extracted from clean Reads according to sequence similarity using the bwa software (v0.7.17) 46 , then these cp-like reads were assembled using the SPAdes (v3.13.1) to generate contigs. These contigs were plotted against reference plastome to adjust their orientation and gaps were filled with the GapCloser 47 . All clean data was then mapped to the final assembly sequence and visualized by the IGV tool to get an overview of the reads coverage, resulting in complete chloroplast genomes. www.nature.com/scientificreports/ Preliminarily gene annotation of four complete chloroplast genomes was performed by the GeSeq online tool (https:// chlor obox. mpimp-golm. mpg. de/ geseq. html) with default parameters 48 . and further revised manually based on the referential chloroplast genome of A.solanacea (NC_045098.1). Finally, a gene map of the annotated Ardisia chloroplast genome was drew using the OGDRAW tool (http:// ogdraw. mpimp-golm. mpg. de/) 49 . Codon usage and prediction of RNA editing sites. To examine the deviation in synonymous codon usage, the amino acid frequency and the relative synonymous codon usage (RSCU) were analyzed using the Molecular Evolutionary Genetic Analysis (MEGA, version 7) 50 . To predict the possible RNA editing sites in the four Ardisia chloroplast genomes, the online tool of Predictive RNA Editor for Plants (PREP, http:// prep. unl. edu/) 19 was adapted with a cutoff value set as 0.8.

Analysis of repeat elements in four
Ardisias. REPuter (https:// bibis erv2. cebit ec. uni-Biele feld. de/ reput er) 51 was used to identify long repeat sequences with a Hamming distance set as three and a minimum repeat size www.nature.com/scientificreports/ set as 30 bp. Additional, Tandem repeats finder (https:// tandem. bu. edu/ trf/ trf. html) 52 was used to detect tandem repeats and the SSRs in the Ardisia chloroplast genomes were identified using MISA software 53 , with alignment parameters set to 2, 7 and 7 for matches, mismatches and indels.
Comparative analysis. The mVISTA tool (http:// genome. lbl. gov/ vista/ index. shtml) was used to investigate the divergences between the Ardisia complete chloroplast genomes and three referential Primulaceae species in the Shuffle-LAGAN mode. Moreover, IR expansions/contractions were summarized manually 54 . Those selected chloroplast genomes were aligned using the MAFFT (v7.419) software with a default setting and then adjusted manually by Se-Al 2.024 55 . Next, MEGA7 was used to calculate the single nucleotide variants (SNV) and the mean genetic distance between the chloroplast genome sequence of Ardisias. Additionally, DnaSP v5.10 was used to calculate the Pi value and the SNP variation sites of the four Ardisia chloroplast genomes. The step size was set to 200 bp with an 800 bp window length 56 .
Phylogenetic analysis. A phylogenetic analysis was conducted using the complete chloroplast genomes of four Ardisia species and forty-six Primulaceae species, with an outgroup was Rhododendron simsii which belongs to Ericaceae. These complete chloroplast genomes were downloaded from the NCBI database (Table S1) and multi-sequence alignment was performed by the MAFFT program. Finally, the phylogenetic analyses with Maximum likelihood (ML) was conducted using the GTR + G substitution model, which was selected based on model screening. Bootstrap analysis was executed with 1,000 replicates and TBR branch swapping.