Complete chloroplast genome of novel Adinandra megaphylla Hu species: molecular structure, comparative and phylogenetic analysis

Adinandra megaphylla Hu is a medicinal plant belonging to the Adinandra genus, which is well-known for its potential health benefits due to its bioactive compounds. This study aimed to assemble and annotate the chloroplast genome of A. megaphylla as well as compare it with previously published cp genomes within the Adinandra genus. The chloroplast genome was reconstructed using de novo and reference-based assembly of paired-end reads generated by long-read sequencing of total genomic DNA. The size of the chloroplast genome was 156,298 bp, comprised a large single-copy (LSC) region of 85,688 bp, a small single-copy (SSC) region of 18,424 bp, and a pair of inverted repeats (IRa and IRb) of 26,093 bp each; and a total of 51 SSRs and 48 repeat structures were detected. The chloroplast genome includes a total of 131 functional genes, containing 86 protein-coding genes, 37 transfer RNA genes, and 8 ribosomal RNA genes. The A. megaphylla chloroplast genome indicated that gene content and structure are highly conserved. The phylogenetic reconstruction using complete cp sequences, matK and trnL genes from Pentaphylacaceae species exhibited a genetic relationship. Among them, matK sequence is a better candidate for phylogenetic resolution. This study is the first report for the chloroplast genome of the A. megaphylla.


Results
Chloroplast genome assembly and annotation. Using the PacBio SEQUEL system, 29,815,452 bp of raw sequence data of the whole genome were generated from A. megaphylla Hu (Fig. 1). The mean read length is 2938 bp, the N50 contig size is 3594 bp and approximately 5% of the genomic genome belongs to the cp genome with 188 × coverage. The cp genome size of 156,298 bp of A. megaphylla Hu was derived from the assembly. As shown in most cp genomes, the assembled A. megaphylla Hu plastome exhibited the typical quadripartite structure comprising of the four regions, a pair of inverted repeats (IRs 26,093 bp), LSC (85,688 bp), and SSC (18,424 bp). Besides, the cp genome of A. megaphylla Hu contains 131 genes, and the percent of the GC content of the cp genome was 37.4% (Table 1).
Repeat sequences and codon analysis. The total number of identified simple sequence repeats (SSRs) in the chloroplast genome of A. megaphylla Hu was 40. All repeats were mono repeats composed of A or T (size of 10-19) ( Fig. 2A). There were no di-, tri-, tetra-, penta-, and hexa-nucleotide SSRs in the A. megaphylla Hu (Fig. 2B). The cp genome of A. megaphylla Hu was identified with 49 repeats consisting of 26 palindromic repeats, 19 forward and 4 reverse repeats. There were no complement repeats (Fig. 3). The smallest unit size of the repeat was 22 bp while the largest unit size was 62 bp. Most of the size of the repeats (72%) was higher than 30 bp.  www.nature.com/scientificreports/ The codon usage frequency of 64 protein-coding genes for three Adinandra species was evaluated. The total number of codons for protein-coding genes was 52,076 in those coding regions. G-and C-ending are found to be more frequent than their counterparts A and U (Table 3). Among the 20 amino acids, serine was the most abundant (number of codons encoding serine = 4975, 9.55%), leucine ranked second (number of codons encoding leucine = 4883, 9.37%), while the rarest one is tryptophan (677 codons, approximately 1.3%). Thirty codons were observed to be used more frequently than the expected usage at equilibrium (RSCU > 1) and thirty-one codons showed the codon usage bias: (RSCU < 1). Moreover, the frequency of use for the start codons AUG and UGG (methionine and tryptophan), as well as AUA (isoleucine) showed no bias (RSCU = 1).
Comparative chloroplast genomic analysis. To characterize genome divergence, the annotation of A.
megaphylla Hu was taken as references. The comparison revealed that three chloroplast genomes were highly similar (Fig. 4). The plastome sequences were fairly conserved across the three data with a few regions with a variation. The results exhibited the divergence in LSC and SSC regions were higher than in IR regions. Besides, the sequences in the coding regions tended to be more conserved whereas most of the variations detected were found in conserved non-coding sequences (CNS). The sequences of exons were nearly identical throughout the three taxa. Among the coding genes, the highly disparate regions included matK, rpoC2, ndhK, ndhD, ycf1.
The sliding window analysis showed that the average pi value of the LSC (Pi = 0.001569) and SSC (Pi = 0.001339) regions was much higher than that in the IR (Pi = 0.000219) regions, which showed that LSC and SSC regions contained the most of the variation (Fig. 5). Among the 3 Adinandra species, the average value of nucleotide diversity (Pi) was 0.00119.

IR contraction and expansion in the chloroplast genome. The IR and SC boundaries of the three
Adinandra were compared. Overall, the results indicated that the size, organization and gene content of the chloroplast genomes were highly similar among the three species. The size of IR ranges from 26,089 bp (A. megaphylla Hu) to 26,095 bp (A. millettii). And the size of IR of A. angustifolia was 26,092 bp. The ndhF gene was situated within the LSC region with a 5 bp overlap with the IRa for all three Adinandra species. Similarly, the  Phylogenetic inference. As shown in Fig. 7A, the phylogenetic analysis was based on matK sequences recovered good resolution among genera. In the Pentaphylacaceae, Euryodendron and Adinandra angustifolia separates outside other genera. The clade of genus Ternstroemia and Anneslea were sisters to the clade of Adinandra and Eurya genera. Indeed, all six Adinandra species are grouped in one clade, which is divided into three subclades with 95% support; A. millettii stood alone in one subclade, A. integerrima and A. dumosa formed the second subclade, three other species separated into the third one. These results were different from the previous study 15 , in which phylogenetic analysis of A. angustifolia, A. millettii, Anneslea fragan and Ternstroemia gymnanthera inferred from the LSC dataset indicated that they belong to one clade (bootstrap values = 100%). This difference might be due to the shortcoming of indicates in phylogenetic analysis when only these four species were representatives of the Pentaphylacaceae appearing in Zhang et al. 's study. In contrast with the matK sequence, the trnL region dataset yielded less phylogenetic resolution than the bootstrap value was 59% at the clade of the genus Adinandra (Fig. 7B). Additionally, the Adinandra was separated into six subclades; one constructed by the studied A. megaphylla Hu, A. formosana and A. lasiostyla constructed two distinct subclades. A. hirta and A. glischroloma; A. millettii and A. hainanensis; A. angustifolia and A. dumosa formed three separated subclades, respectively (Fig. 7B). In the case of barcoding among the Pentaphylacaceae family, the matK sequence is suggested for better phylogenetic resolution.

Discussion
Pentaphylacaceae is a family of flowering plants and contains 12 genera including approximately 345 species over the world 16 . A total of 8 cp genomes in the Pentaphylacaceae family have been published currently, 2 of which belong to Adinandra. The genus Adinandra consists of about 85 species mainly distributed in Bangladesh, Cambodia, China, India, Indonesia, Southern Japan, Laos, Malaysia, Myanmar, New Guinea, Philippines, Sri Lanka, Thailand, and the African tropical forest 17 . Because of bioactive compounds, many species in the genus Adinandra are of interest [18][19][20][21][22][23] .
In the present study, we recently sequenced whole cp genomes for one Vietnamese. Adinandra megaphylla Hu and implemented comparative analyses on three Adinandra cp genomes to explore the structure of cp genomes in the taxa. Gene organization together with codon usage patterns was characterized and results indicated the high conservation, which can be helpful for phylogenetic and population genetics studies. www.nature.com/scientificreports/ Angiosperm chloroplast genomes have a highly conserved structure and gene content 24,25 . Roughly 129 genes are usually found across the angiosperm chloroplast genomes, among which 18 genes include introns. The analyzed Adinandra chloroplast genomes specified the typical quadripartite structure and showed the expected size range (~ 15.6 kb) for angiosperm plants and the conserved gene contents 25,26 . Our gene annotation results were similar to the genetic properties of angiosperm chloroplast genomes. The number of genes present in the cp genome from A. megaphylla Hu was 131 and there 18 genes related to introns.
Apart from the two copies of inverted repeats, 48 small repeats were spread out within coding and non-coding regions of the three Adinandra taxa. The repeat numbers are not remarkably higher but comparable to other counterparts (the number of dispersed repeats: 49 in Papaver spp.; 21 in Paris spp.; 36 in Passiflora; 37 in Aconitum) [27][28][29] . Repeats are highly associated with the plastome reconstruction in several angiosperm taxa and can be www.nature.com/scientificreports/ considered as an indication of recombination 30 . Due to the potential to generate secondary structures, repeated sequences can act as recognition signals during the recombination process 31 . It is supposed that recombination rarely occurs in angiosperms because of the predominance of uniparental inheritance. Nevertheless, evidence of intermolecular homologous recombination in flowering plants has been reported 32,33 . To date, studies screening plastome recombination in the taxa are entirely lacking. There was no research demonstrating the presence of plastome recombination in Pentaphylacaceae. In this study, the higher number of repeats in comparison with previous estimates might not be substantiation for inter-and intra-specific plastome recombination. In terms of constructing phylogenetic relationships of plants, complete chloroplast genomes contribute adequate information and have proven their effectiveness in the capability of classification in lower taxonomic levels 34,35 . matK is one of the common DNA barcodes used in plants 36 . However, the phylogeny results indicated that using only a single gene for species classification may generate different results from different genes. The combination of these barcodes can lead to better species identification.

Conclusion
In this study, three complete chloroplast genomes of Adinandra were investigated, including one firstly sequenced chloroplast genomes (A. megaphylla Hu) comparatively analyzed with other published genus in the family of Pentaphylacaceae for the first time. We assemble the complete chloroplast genome of A. megaphylla Hu with 156,298 bp. The structure and gene content of the chloroplast genome of three Adinandra were similar and appeared highly conserved. Finally, the phylogenetic relationships built for species of Pentaphylacaceae, in terms of comparison public date with our novel sequence of Adinandra species. This study provides the potential of chloroplast genome sequences for enhancing species classification and phylogenetic research for in-depth study within Pentaphylacaceae.   DNA extraction and chloroplast genome sequencing. Genomic DNA was extracted from young plant leaves using a modified CTAB method 37 . A260/280 and A260/A230 ratios were measured with the Shimadzu Biospec Nano to assess DNA sample purity. The accurate concentration of double-stranded DNA was determined with Qubit 3 Fluorometer and Qubit HS DNA reagents. Genomic DNA integrity was assessed by agarose gel electrophoresis with 0.8% agarose. Also, DNA libraries were prepared from total genomic DNA using SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA), and adapter ligation was subsequently performed, following the manufacturer's protocol for genomic DNA above 20 kb (Pacific Biosciences). SMRTbell libraries were loaded on one chip and sequenced on a Pacbio SEQUEL system at the Key Laboratory for Gene Technology, Institution of Biotechnology (Hanoi, Vietnam).
Genome assembly and annotation. The total gDNA was sequenced in the PacBio platform by the resequencing method. The sequences derived from the cp genome were identified via the local Blast program 38 using Adinandra angustifolia (MF179491) cp genomes as the reference 15 . Subsequently, the software HGAP4 39 was used to assemble the cp genome. The protein-coding, rRNA, and tRNA genes were annotated by the CpGAVAS pipeline 40 . The tRNAscan-SE ver. 1.21 software 41 was applied to verify the tRNA genes with default parameters. The OrganellarGenomeDRAW tool (OGDRAW) ver. 1.3.1 42 was selected to create the circular gene map. Repeat elements were found using two approaches. Web-based simple sequence repeats finder MISA-web 43 was used to detect microsatellites, including 10 repeat units for mono-, 5 repeat units for di-, 4 repeat units for tri-, and 3 repeat units for tetra-, penta-, and hexa-nucleotide SSRs. Among the SSRs of each type, comparing the size of SSRs was employed to count the polymorphic SSRs among the three species. The size and type of repeats in the three Adinandra plastomes were investigated using REPuter 44 with the set parameters as follows: a minimal repeat size of 20 bp, hamming distance of 3 kb, and 90% or greater sequence identity.
Genome comparison. For comparative purposes, we collected two available cp genomes of A. angustifolia (#MF179491) and A. millettii (#MF179492) from GenBank (https:// www. ncbi. nlm. nih. gov/ genba nk/). The overall genome structure, genome size, gene content and repeats across all three Adinandra species were compared 15 . The whole plastome sequences of the three Adinandra plants were aligned with the MAFFT server 45 and visualized using LAGAN mode in mVISTA 46 . For the mVISTA plot, we used the annotated cp genome of A. megaphylla Hu as a reference. The Irscope 47 was employed to visually display and compare the borders of large single-copy (LSC), small single-copy (SSC), and inverted repeat (IR) regions among the three Adinandra species. We also determined the codon usage bias and the sequence divergence among the three Adinandra species www.nature.com/scientificreports/ through a sliding window analysis computing pi among the chloroplast genomes in DnaSP ver. 6.12.03 48 . For the sequence divergence analysis, we applied the window size of 600 bp with a 200 bp step size.
Phylogenetic identification. The sequences of matK and trnL from all Adinandra species and other members of the family Pentaphylacaceae from Genbank (https:// www. ncbi. nlm. nih. gov/ genba nk/) were used to identify the taxonomic position of the studied A. megaphylla Hu. These sequences were aligned with ClustalW mode in Unipro UGENE software v36.0 49 before a maximum likelihood (ML) 50 phylogenetic tree was constructed using Mega-X software 51 with 1000 bootstraps. The chosen methods followed the previous study of this genus 15 .