Comparative and phylogenetic analyses of nine complete chloroplast genomes of Orchidaceae

The orchid family has 200,000 species and 700 genera, and it is found worldwide in the tropics and subtropics. In China, there are 1247 species and subspecies of orchids belonging to the Orchidaceae family. Orchidaceae is one of the most diverse plant families in the world, known for their lush look, remarkable ecological tolerance, and capability for reproduction. It has significant decorative and therapeutic value. In terms of evolution, the orchid family is one of the more complicated groups, but up until now, little has been known about its affinities. This study examined the properties of 19 chloroplast (cp) genomes, of which 11 had previously been published and nine had only recently been revealed. Following that, topics such as analysis of selection pressure, codon usage, amino acid frequencies, repeated sequences, and reverse repeat contraction and expansion are covered. The Orchidaceae share similar cp chromosomal characteristics, and we have conducted a preliminary analysis of their evolutionary connections. The cp genome of this family has a typical tepartite structure and a high degree of consistency across species. Platanthera urceolata with more tandem repeats of the cp genome. Similar cp chromosomal traits can be seen in the orchidaceae. Galearis roborowskyi, Neottianthe cucullata, Neottianthe monophylla, Platanthera urceolata and Ponerorchis compacta are the closest cousins, according to phylogenetic study.


Genome assembly and annotations
Using NOVOPlastyv4.2 14, we assembled the chloroplast genome, and the more comprehensive results served as the final genome for screening.In addition to manual checks, the newly assembled chloroplast genomes were annotated with the help of PGA (Plastid Genome Annotator) 15 , the cp genome of Dipterocarpus turbinatus (Genbank: NC_046842.1)served as a reference, and the tRNA genes were verified with the help of ARAGORNv1.2.38 16 and tRNAscan-SEv2.0.7 17 .Using OGDRAW, a completely annotated circular plastid map was created (OrganellarGenomeDRAW) 18 .
Using the online MIcroSAtellite (MISA) 2.1 tool, simple repeat sequences (SSRs) were found in the chloroplast genomes of 19 plant species.There were eight single nucleotides, five dinucleotides, four trinucleotides, and three each of tetranucleotides, pentanucleotides, and hexanucleotides utilized as repeat unit parameters.

Comparative analyses
Taking the Galearis roborowskyi genome as a reference, the homogeneity of the entire chloroplast genomes of 19 species were visualized to examine chloroplast genomic differences, using the shuffle-LAGAN program of mVISTA v2.0 19 .Using IRSCOPE, the borders of the IR, SSC, and LSC areas were also compared.DnaSPv6.12.03, which was used to explore nucleotide diversity (Pi), with the window length set to 600 bp and the step size set to 200 bp, was utilized to extract and analyze the coding and non-coding sections.
The codon adaptation index (CAI), codon bias index (CBI), frequency of optimum codons (Fop), effective number of codons (ENc), GC content of synonymous third codon positions (GC3s), and relative synonymous codon usage values were used to evaluate codon preferences (RSCU) 20 .

Selective pressure analysis
The selection pressure in orchids was calculated using the ratio of non-synonymous substitutions (Ka) to synonymous substitutions (Ks) (Ka/Ks).Initially, the four chloroplast genomes of orchids were compared, and 79 protein-coding genes were extracted.The non-synonymous substitution rate (Ka) and synonymous substitution rate (Ks) for each gene were calculated, as well as the ratio of the two (Ka/Ks), with positive selection being indicated by a Ka/Ks ratio greater than 1, neutral selection being indicated by a Ka/Ks ratio of 1, and purifying selection by a Ka/Ks ratio less than 1.

Phylogenetic inference
For phylogenetic reconstruction, we collected 10 publically accessible chloroplast genome sequences from the Genbank website (two of which were from external populations).Prior to trimming the lines of comparison for each gene, all single-copy genes from the 19 taxa were extracted.For phylogenetic investigations, they were also joined in a permutational manner.Finally, a phylogenetic tree (ML) was built using TreeBeSTv1.9.2 21 .The support of branches was evaluated by 100 rapid bootstrap replications.

Comparison among chloroplast genomic features in Orchidaceae
Galearis roborowskyi was discovered to have the least chloroplast genome size (149,067 bp), whereas Herminiumm onorchis had the greatest (156,412 bp) (Fig. 1) (support document).The LSC, SSC, and IR lengths for these 11 species are displayed in Table 1.We discovered between 124 and 140 distinct genes, including between 83 and 94 protein-coding genes, 8 rRNA genes, and 38 tRNA genes (Tables 1, 2).
The cp genome of Galearis roborowskyi was then further compared using the mVISTA method, revealing a similar pattern of sequences throughout the chloroplast genome, which comprises 17 orchid species and two outgroups (Aloe maculata and Aloe vera) (Fig. 2, support document).The findings demonstrate that, in contrast to the two outgroups, the chloroplast genomes of 17 species of orchid have comparable architecture and gene sequences.All studied species had more conserved coding regions than non-coding regions.The LSC region also diverged from the SSC region more so than it did from the IR region.The alignment revealed that genes, particularly the psbA matK and rps16 genes, were less conserved in the genomes of Neottianthe monophylla.

Contraction and expansion of inverted repeats
The size of the cp genome can alter due to the contraction and growth of the IR area, which also has an impact on the rate at which cp genes evolve 22,23 .A comparison of the IR boundaries of 28 species of orchids revealed that the IR boundary locations with the most pronounced alterations were IRb/SSC, SSC/IRa, and IRa/LSC (Fig. 5).The LSC/IRa and LSC/IRb edges of the Orchidaceae chloroplast genome are substantially conserved, with virtually the same genes flanking them.The rps3 gene is found on IRb at the junction of LSC and IRb.The rpl22 gene spans the LSC/IRb area in most species, with Neottia suzukii and Neottia ovata having the highest expansion, with the exception of Aloe vera、 Aloe maculata and Hancockia uniflora.

Codon usage and amino acid frequency
The degree of relative synonymous codon usage, sometimes referred to as codon usage preference, measures how frequently a specific codon is utilized in the codon that codes the corresponding amino acid.When RSCU > 1, the codon is used more frequently.This codon has no preference in the case of RSCU = 1; if it is used very infrequently, RSCU1 (support document).
To assess codon usage in the Orchidaceae, we analyzed codon usage deviations for genes in the cp genomes of nine orchid species (Tables 3 and 4).Codon use deviations were derived using relative synonymous codon usage (RSCU).Six codons that encode the amino acids arginine (Arg), leucine (Leu), and serine (Ser) were shown to have the highest preference in this investigation.The arginine was determined to have the highest (2.00 to 2.20)

Selective pressure analysis
DnaSP software was used to determine the chloroplast codon Ka/Ks in order to compare the Ka/Ks of 21 distinct species pairs and further examine the selection pressure on chloroplast genes in orchids during evolution (Fig. 8).
For each of the 21 species pairs, Ka/Ks ratios were computed.For the pairs Herminium monorchis-Platanthera urceolata and Galearis roborowskyi-Neottianthe monophylla, higher Ka/Ks ratios were found.The photosynthesisrelated genes atpF, ndhD, ndhE, and ndhH, the expression-related genes rpl22, rpoC1, rpoC2, accD, and ycf1 of other functional genes, and the genes related to expression-related genes rps18 and rpoC2 all showed Ka/Ks > 1, indicating that these genes were under positive selection during evolution (Table 5).

Phylogenetic relationships among Orchidaceae
Using Aloe maculata and Aloe vera as outgroups, the evolutionary relationships of the cp genes in 28 orchid species were investigated.Using the ML、NJ and MP technique, a developmental tree of 50 single-copy genes was created (Fig. 9).The cp genomes of 28 species of orchids were used in this study to infer phylogenetic relationships, and the ML, NJ and MP analysis was used to compare those relationships to outgroups like Aloe macrophylla and Aloe vera.

Discussion
Kim et al. compared the chloroplast genome size of 30 orchids and their gene loss, and found that most of the genes associated in heterotrophic orchids had been lost (ndh), while most of the housekeeping genes retained 24 .This result was also verified in Kim et al. 25 .Chloroplast genomes have strong conservation in plant evolution.First, it is a typical four-segment structure, and then it is highly conserved 26,27 in gene content and gene order, which is why chloroplast genomes are widely used for phylogenetic research.Nine Orchidaceae species' chloroplast genome lengths were examined in this study.Nine Orchidaceae have full chloroplast genomes, with an average length of 153,330 bp and sizes ranging from 149,067 bp (Galearis roborowskyi) to 156,412 bp (Herminiumm onorchis).The tetrad structure of the chloroplast genome in land plants makes it highly conserved under normal conditions.The majority of the 74 protein-coding genes in the angiosperm chloroplast genome are present, but there are also instances of gene capture, gene rearrangement, and gene loss in various families and species 28,29 .Comparative analysis makes it simple to locate mutation hotspots because of the plant cp genome's highly conserved structure.In population genetic or phylogenetic investigations, these mutational hotspots surrounding by conserved sequences are frequently utilized as DNA barcodes 28,47 .We discovered that sequence variation in Orchidaceae primarily occurs in the non-coding regions via a combined analysis of mVISTA sequence variation and DnaSP-inferred nucleotide variation.Three unique areas (psbA, matK, and rps16) were identified in the investigation of the sequence variation in the cp genome (Fig. 2).TrnS-GCU-trnG-UCC, trnT-GGU-psbD, trnI-GAU-rrn16, and rpl2 were all shown to be highly variable according to our sliding window analysis.To determine which of these high variation genes or gene spacers could be utilized as accurate and trustworthy DNA barcodes in the genus Orchidaceae, more research is required.
We hypothesize that the nine Orchidaceae species' diverse chloroplast genome lengths may result from the expansion and contraction of the boundary between the SC region and the IR sections 29 .The results further  www.nature.com/scientificreports/demonstrated the existence of IR areas, as well as their extension and contraction, by comparing and evaluating the IR/SC boundary sections of the nine species of chloroplast genome.The findings demonstrate that all nine Orchidaceae species exhibit the characteristic chloroplast tetrad structure, which is structured in the form of two SC areas and two IR regions at regular intervals.www.nature.com/scientificreports/Mutation pressure 30 and selection pressure 31 are the main influencing factors leading to codon preference, but also affected by other factors, such as gene expression level 32 , gene length 33 and tRNA abundance 34 , etc.Our findings demonstrated that codon usage bias was preserved across species in the Orchidaceae 35 , and that codon usage alterations play a significant role in the evolution of the cp genome.Additionally, the majority of codons preferred to end in A/U with RSCU1, indicating that the cp genome's adaptive evolution may have contributed to some degenerate codon usage bias 36 .Additionally, all ENc values are greater than 54.85, and the values for CAI, CBI, and Fop are significantly lower than one, showing that all eleven species have very low codon use biases.Liu Jiangfeng selected 47 protein coding sequences from the garlic chloroplast genome of sickle wing, analyzed codon usage patterns, and found that codon preference was affected by selection and mutation, as well as some other influencing factors.
Previous studies showed that polymorphism at the SSR locus is useful in studying population genetics [37][38][39] .In this study, the minimum repeats of one, two, three, four, five, and six nucleotides were set to 8, 4, 4, 3, 3 and 3 using the MISA software.A total of 255 SSRs, including 226 SSRs made up of the A/T, AT/TA, AAT/ATT, AAAT/ ATTT, and AATT/AATT, were found in Platanthera urceolata.This confirms that the SSRs in the chloroplast genome are primarily made up of short tandem repeats of the A and T, which is similar to the findings published for other plant chloroplast genomes.Previous research has shown that A/T repeat types predominate among all repeat units in many plant chloroplast genomes, and this phenomenon has also contributed to the extremely high AT content in chloroplast genomes 40 .
Numerous studies have demonstrated that understanding the adaptive genetic evolution of the chloroplast genome is crucial to understanding changes in gene structure and function 41 .It is common practice to utilize the ratio of non-synonymous substitution rates (Ka) to synonymous substitution rates (Ks) as a measure of selection pressure between various species at the sequence level 42 .In most genes in an organism, synonymous nucleotide substitutions occur more frequently than non-synonymous changes; as a result, Ka/Ks values are typically lower than one 43 .This study's findings indicated that 10 genes had undergone positive selection.Rpl22, RpoC1, RpoC2, and Rps18 were connected to gene expression among these genes.Similarly to the findings of most plant research, the photosynthesis-related genes atpF, ndhD, ndhE, and ndhH as well as additional functional genes accD and  www.nature.com/scientificreports/ycf1 all showed Ka/Ks greater than 1 in most species, indicating the presence of positive selection pressure on these genes.Numerous scientists have conducted in-depth phylogenetic analyses of chloroplast genome sequences in recent years.This is critical to our comprehension of how angiosperms evolved from other organisms 44,45 .Complete chloroplast genomes have been utilized by several researchers to examine the evolutionary relationships and relatedness of plants 46,47 .For this study, 17 published orchid chloroplast sequences were chosen in order to better understand the evolutionary links among orchids.The phylogenetic relationships of orchids were modeled using Aloe vera and Aloe maculata as outgroups.The findings suggest that the species were split into two groups.Galearis roborowskyi, Neottianthe cucullata, Neottianthe monophylla, Platanthera urceolata and Ponerorchis compacta all belonged to the same group, and the clustering map made it evident how closely related the Orchidaceae species are to one another.This study gives some theoretical support for the detailed investigation of the phylogeny of orchids and demonstrates the effectiveness of the chloroplast genome in separating out the phylogenetic links of orchid species.In this paper, nine complete orchid chloroplast genomes are revealed.

Conclusions
An examination of the cp genome sequences of 17 species of orchids revealed that their cp genome organization, gene sequencing, codon use, and repetitive sequence traits are remarkably comparable.In general, these structures are conserved, although the constriction and infrared regions are observed for the expansion of this region associated with plastid sequence variation.Analysis of positive selection of genes in the chloroplast genome of this family suggests that atpF, ndhD, ndhE, and ndhH may play a role in the growth of most Orchaceae species to strongly light environments.These genomic data provide new insights into the interspecific relationships of the Orchidaceae species.The phylogenetic analysis of the chloroplast genome's single-copy genes revealed that 19 species may be separated into two groups, which offers some theoretical support for a thorough investigation of the phylogeny of the orchidaceae.This study sets the foundation for further exploration of the taxonomic, phylogenetic and evolutionary history of Orchidaceae.

Figure 1 .
Figure 1.A genetic map of Herminium monorchis chloroplast genome.Genes inside the circle have their transcription going clockwise, whereas those outside the circle have it going the other way.Different colors that code for genes designate various functional groupings.The amount of guanine-cytosine (GC) is represented by the grey-black portion of the inner circle, while the amount of adenine-thymine (AT) is shown by the light grey portion.The inner circle displays the reverse repeat (IRa, IRb) regions as well as the small single-copy region (SSC) and large single-copy region (LSC).

Figure 2 .
Figure 2. Using the shufe-LAGAN program, the chloroplast genomes of 23 distinct species were examined.The horizontal axis displays the location in the chloroplast genome, and the same proportions are displayed in the vertical direction at a scale of 50 to 100%.The gene being labeled and the direction of transcription are represented by each arrow.Exons, tRNA, conserved non-coding sequences, and mRNA are designated by different colors as genomic areas (support document).

Figure 3 .
Figure 3. Changes in chloroplast GC content of all 17 species.

Figure 4 .
Figure 4.Nucleotide polymorphism of the chloroplast genomes of Orchidaceae species.
The tree was created with 26 nodes.All phylogenetic trees have the same topology (the three trees are presented together), and most of the nodes have 100% bootstrap support, indicating high analysis confidence.The 28 orchid species studied are mainly divided into several large clades, of which Neottianthe, Platanthera, Hetaeria, and Neottia all clearly clustered into one clade, indicating that their congeners are more closely related.

Figure 5 .
Figure 5.Comparison of the borders of the all regions among 28 chloroplast genomes of Orchidaceae.

Figure 6 .
Figure 6.Frequency of diferent microsatellite motifs in diferent repeat types of nine Orchidaceae plastome genomes.

Figure 7 .
Figure 7. Number of different SSR types in the nine Orchidaceae chloroplast genomes.

Figure 8 .
Figure 8. Ka/Ks values of different functional genes.

Figure 9 .
Figure 9. Phylogenetic tree of 28 species based on chloroplast genomes,The topology is indicated with ML/NJ/ MP bootstrap support values at each node.

Table 1 .
Characteristics of the chloroplast genomes of nine species of orchids.

Table 2 .
Genes difference of the chloroplast genomes of eleven Orchidaceae species.

Table 3 .
The indexes of the codon usage bias of protein-coding genes of Orchidaceae.

Table 4 .
Codon content of 20 amino acids and stop codons in Neottianthe cucullata.

Table 5 .
Number of different SSR types in the nine Orchidaceae chloroplast genomes.

roborowskyi Herminium monorchis Malaxis monophyllos Neottia puberula Neottianthe cucullata Neottianthe monophylla Platanthera urceolata Ponerorchis chusua Ponerorchis compacta Total
Further questions can be sent to the respective authors, whose original contributions are mentioned in the article/ supplementary material.The chloroplast genome sequence mentioning species has been uploaded to NCBI with accession numbers shown below:The sequence of other closely related and outer related species used in the analysis were downloaded from NCBI with the following accession numbers: org/10.1038/s41598-023-48043-2 www.nature.com/scientificreports/Data availability