Characterization of the chloroplast genome of Gleditsia species and comparative analysis

The genus Gleditsia has significant medicinal and economic value, but information about the chloroplast genomic characteristics of Gleditsia species has been limited. Using the Illumina sequencing, we assembled and annotated the whole chloroplast genomes of seven Gleditsia species (Gleditsia sinensis, Gleditsia japonica var. delavayi (G. delavayi), G. fera, G. japonica, G. microphylla, Fructus Gleditsiae Abnormalis (Zhū Yá Zào), G. microphylla mutant). The assembled genomes revealed that Gleditsia species have a typical circular tetrad structure, with genome sizes ranging from 162,746 to 170,907 bp. Comparative genomic analysis showed that most (65.8–75.8%) of the abundant simple sequence repeats in Gleditsia and Gymnocladus species were located in the large single copy region. The Gleditsia chloroplast genome prefer T/A-ending codons and avoid C/G-ending codons, positive selection was acting on the rpoA, rpl20, atpB, ndhA and ycf4 genes, most of the chloroplast genes of Gleditsia species underwent purifying selection. Expansion and contraction of the inverted repeat (IR)/single copy (SC) region showed similar patterns within the Gleditsia genus. Polymorphism analysis revealed that coding regions were more conserved than non-coding regions, and the IR region was more conserved than the SC region. Mutational hotspots were mostly found in intergenic regions such as “rps16-trnQ”, “trnT-trnL”, “ndhG-ndhI”, and "rpl32-trnL” in Gleditsia. Phylogenetic analysis showed that G. fera is most closely related to G. sinensis,G. japonica and G. delavayi are relatively closely related. Zhū Yá Zào can be considered a bud mutation of the G. sinensis. The albino phenotype of G. microphylla mutant is not caused by variations in the chloroplast genome, and that the occurrence of the albino phenotype may be due to mutations in chloroplast-related genes involved in splicing or localization functions. This study will help us enhance our exploration of the genetic evolution and geographical origins of the Gleditsia genus.


Analysis of the chloroplast genome structure of Gleditsia
Using GCview to visualize the sequence alignment between multiple chloroplast genomes, it was found that the sequence between different species of Gleditsia were similar.Through SSR identification, in Gleditsia plastomes, the total number of SSRs ranges from 85 to 133 SSRs, while in the Gymnocladus it varied from 109 to 125.Moreover, most (65.8-75.8%) of the SSRs in Gleditsia and Gymnocladus species were located in the LSC.In Gleditsia, the IR regions include between 2.2 and 5.5% SSRs loci, while the SSC region included between 17.9 and 23.3% (Fig. 1a).In the Gymnocladus sequenced here, 69.9-75.3% of the SSRs were situated in the LSC.A total of 89 SSR sites were detected in G. sinensis, including 85 mononucleotide and 4 dinucleotide repeat units.The most abundant repeats were mononucleotide repeats in the Gleditsia genus (Fig. 1b).There were 50 repeats in G. sinensis (Fig. 1c), which included complementary, forward, palindromic, reverse repeats.
The chloroplast genome sequence in Gleditsia was analyzed using the chloroplast genome of G. sinensis as a reference with mVISTA.It was found that the chloroplast genome sequences within the genus Gleditsia were highly similar and conserved, with the coding region being more conserved than the non-coding region, and the IR region being more conserved than the SC region.The IR/SC junctions of the chloroplast genome within Gleditsia showed similar features (Fig. 2).The lengths of the IR regions in the Gleditsia chloroplast genome ranged from 26,122 to 26,619 bp.The rps3 gene was present in the LSC region in G. sinensis, Zhū Yá Zào, G. fera, G. japonica, G. delavayi, G. microphylla, and G. microphylla mut., and all IRs contained a gene rps19, ranging from 61 to 221 bp from the JLB (junction between LSC and IRb) junction.In the sequenced Gleditsia species, the ndhF gene was completely present in the SSC and away from the junction, and the trnH gene was entirely located in the LSC region.These data suggest that the expansion and contraction of the IR/SC region exhibit similar patterns within Gleditsia.

Codon bias analysis and selective pressures in the evolution
The GC and GC3s content in the codons of the 9 chloroplast genomes studied were both less than 0.5, indicating a preference for A/T bases and A/T-ending codons in Gleditsia and Gymnocladus chloroplast genomes.We used the CDS of the chloroplast genome to estimate the codon usage frequency of the seven taxa of Gleditsia.All 20 amino acids were encoded by codons in the Gleditsia chloroplast genome and the synonymous codon usage (RSCU value) values were similar.Of the 29 codons with an RSCU value > 1, only one ended with G (TTG).The codons with an RSCU value < 1, except for ATA and CTA ending in A, ended in C or G. Codon pairs ending with C and G in the Gleditsia chloroplast genome had low bias and were non-preferred codons.
We analyzed the ka/ks ratio of the 76 unique protein-coding genes in the 9 chloroplast genome (Fig. 3), using G. sinensis as the reference, five genes (rpoA, rpl20, atpB, ndhA, ycf4) were identified under positive selection (Ka/Ks > 1).Ka/Ks ratio of most gene was less than 1.

Discussion
The chloroplast genome generally ranges in size from 120 to 160 kb and exhibits a highly conserved structure 26 .The sequencing, assembly, and analysis of chloroplast genomes can identify common features or differences between species, which can be used as DNA barcodes.Seven Gleditsia species and two Gymnocladus species both have a circular tetrad structure, consisting of one LSC and SSC region, separated by two IR inverted repeat regions, the size of the Gleditsia chloroplast genome ranged from 162,746 to 170,907 bp.Most of the SSRs were located in the intergenic areas 27 .Based on SSR identification and examination of their location on the chloroplast genome, it was found that Mononucleotide repeats were the most abundant SSR type in the Gleditsia genus.The majority of SSRs (65.8-75.8%) in Gleditsia and Gymnocladus species were located in the LSC region, which is consistent with previous reports on chloroplast SSRs in other plants 28,29 .Codon usage of highly expressed genes was selected in evolution to maintain the efficiency of global protein translation 30 .The RSCU values of the CDSs of Gleditsia in the present study were similar, the RSCU values of tryptophan and methionine amino acids were 1, they were the only amino acids with no codon bias.There were 29 codons with an RSCU value > 1, only one of which ended with G (TTG); The codons with an RSCU value < 1, except for ATA and CTA ending in A, ended in C or G, the codon pairs ending with C and G in the Gleditsia chloroplast genome had low bias, and they were nonpreferred codons.The codons with an RSCU value > 1 were prefer A/T-ending codons in Gleditsia genus (Fig. 3b).Six Euphorbiaceae plant species 31 and seven Miscanthus species 32 were biased towards A/T bases and A/T-ending codons.Quercus chloroplast genomes prefer A/T-ending codons and avoid C/G-ending codons 33 .Additionally, the Ka/Ks revealed selection pressure on protein-coding genes 34 , Ka/Ks ratios > 1, close to 1, or < 1 indicate that the gene has undergone positive selection, neutral selection, or purifying selection, respectively 35 .The Ka/Ks ratios for the majority (74 of 79) genes were below 1 for  the four Carya species, indicating that purifying selection were acting on these genes in C. illinoinensis 36 .Most of the CDS genes in Chrysosplenium had a Ka/Ks ratio range from 0.1 to 0.3, implying strong purification 37 .The average Ka/Ks ratio was 0.17, indicating that the genes in the Eruca sativa were subject to strong purifying selection pressures 38 .Purifying selection constantly sweeps away deleterious mutations in population, the purifying selection on most chloroplast genes within Chrysosplenium would be evolutionary result of the preservation of the adaptive characteristics of Chrysosplenium species 37 .G. microphylla is used currently for food, health care products, and cosmetics, as well as for the treatment of various cancers and heart, vascular, and infectious diseases 39 .G. japonica pod flat, irregularly twisted; G. delavayi distributed only in Yunnan and Guizhou Province, China; G. fera distributed gentle slopes, mountain valleys, forests, beside villages, near roads, sunny places, occasionally cultivated, among the species studied, G. fera can be divided into fast-growing genotype 5 ; G. australis seed implantation site is obviously swollen, few fruitless necks 40 .G. velutina is endemic to Hunan Province, China, and is a rare and endangered plant under national key protection 41 .Stress-related genes had been positively selected during the evolution through comparative transcriptome analysis of Gleditsia genus 42 .

tK n d h A n d h B n d h C n d h D n d h E n d h F n d h G n d h H n d h I n d h J n d h K p e tA p e tB p e tD p e tG p e tL p e tN p s a A p s a B p s a C p s a I p s a J p
In this study, positive selection was acting on five genes (rpoA, rpl20, atpB, ndhA, ycf4), which were identified under positive selection (Ka/Ks > 1), Ka/Ks ratio of most gene were less than 1, pairwise Ka/Ks analysis showed that most of the chloroplast genes of Gleditsia species underwent purifying selection, the purifying selection on most chloroplast genes within Gleditsia would be evolutionary result of the preservation of the adaptive characteristics of Gleditsia species.IR region can indicate the distance between species to a certain extent 43 .The highly variable regions can provide useful plastid markers for studies regarding the identification, phylogeny, and population genetics 44 .Using mVISTA to analyze chloroplast genome sequences within the genus Gleditsia, coding regions were more conserved than non-coding regions, and IR regions were more conserved than SC regions.Analysis of IR amplification data indicates that expansion and contraction of IR/SC regions show similar patterns within the genus, which is also proved from the polymorphism analysis, in which the IR regions were conserved relative to the www.nature.com/scientificreports/SSC and LSC regions, similar to studies in other plants 45 .Mutation hotspots can be used as suitable loci for population genetics and phylogenetic studies.Hypervariable regions can be as candidates for DNA barcode development 46 .DNA barcodes derived from chloroplast genomes will be useful for identifying varieties and resources 12 .DNA barcodes with the largest nucleotide diversity are considered to be the focus of phylogenetic analysis and plant identification 47 .Chloroplast gene sequences (ndhF and rpl16) are selected to test biogeographic hypotheses, there is a fundamental division of the genus Gleditsia into three clades 9 .According to sliding window analysis, rps16-trnQ, rpl32-trnL, ndhD-psaC and ycf1 showed the greatest variations in Ilex 48 .The several non-coding sites (psbI-atpA, atpH-atpI, rpoB-petN, psbM-psbD, ndhf-rpl32, and ndhG-ndhI) and three genes (ycf1, ycf2, and accD) showed significant variation 49 .Positive selection is observed in 14 protein coding genes (accD, ccsA, ndhA, ndhB, psbJ, rbcL, rpl20, rpoC1, rpoC2, rps12, rps18, ycf1, ycf2 and ycf4) in nine species of subfamily Zingiberoideae 50 .Ka/Ks values of three genes petL, rpl20, and ycf4 were higher than one in the pairwise comparation of Galegeae officinalis and other three Galegeae species 51 .Mutational hotspots of shared genes and intergenic spacers of the chloroplast genomes of the Gleditsia species were identified.Taking the common protein coding sequence of Gleditsia as the research object, ycf1 and petL were found as mutational hotspots.ycf1 encodes unknown function proteins.The petL gene encodes the 3.5 kDa subunit of cytochrome b6/f complex 52 .
A genetic distance analysis based on the ISSR genetic diversity revealed that G. japonica and G. delavayi had a closer genetic relationship 55 .By using the complete chloroplast genomes and shared CDS genes, phylogenetic analysis was performed.The results showed that the two datasets produced similar phylogenetic trees, the relationships of genus were consistent with high support and only differed for some nodes' supporting values.Based on morphology and phylogenetic analysis, G. japonica and G. delavayi appear most closely related.Zhū Yá Zào is derived from the plant G. sinensis, produced by old or injured plants, there was no significant difference in the contents of saponin compounds between Fructus Gleditsiae abnormalis and Fructus Gleditsiae sinensis by LC-ELSD 7,56 .Li et al. 8 suggested that Zhū Yá Zào should be a variant of G. sinensis.The evolutionary relationship between G. sinensis and Zhū Yá Zào was the closest, G. sinensis and Zhū Yá Zào clustered into a subclade (Fig. 5).Zhū Yá Zào can be considered a bud mutation of the G. sinensis.
Albino phenotypes often occur in nature.In the process of raising seedlings in the greenhouse, a albino mutant plant of G. microphylla was obtained (labeled G. microphylla mutant), which was characterized by albino whole plant, obvious dwarfing, and natural death after 1-1.5 months of growth.OsSLC1 is responsible for the seedling-lethal chlorosis phenotype in the rice seedling-lethal chlorosis 1 mutant, loss-of-function of OsSLC1 affected the intron splicing of multiple group II introns, and especially precluded the intron splicing of rps16 57 .www.nature.com/scientificreports/ The albinism of Camellia sinensis cv.Baiye1 was due to chloroplast dysplasiaand the blocking synthesis of Pchlide a from Mg-proto IX 58 .Deficiency in grana stacking in chloroplasts and inhibition of gene expression related to chloroplast localization may also lead to the production of albino seedlings 59 .By assembling and comparing the chloroplast genomes of the G. microphylla mutant and G. microphylla, we found that their sequences were completely identical.This suggests that the albino phenotype is not caused by variations in the chloroplast genome, and that the occurrence of the albino phenotype may be due to mutations in chloroplast-related genes involved in splicing or localization functions.This requires further experimental validation in the future.

Conclusion
In this study, we sequenced and compared the complete chloroplast genomes of seven genotypes from Gleditsia.
Assembly and annotation of the chloroplast genomes found that Gleditsia species chloroplast genomes have a typical circular tetrad structure, the size of the chloroplast genomes ranged from 162,746 to 170,907 bp.Through comparative genomic analysis, most (65.8-75.8%) of the SSRs in Gleditsia and Gymnocladus species are located in the LSC.The codon pairs ending with C and G in the Gleditsia chloroplast genome have low bias which are nonpreferred codons, the genus Gleditsia prefer T/A-ending codons and avoid C/G-ending codons.The selection pressure estimation (Ka/Ks ratios) of genes in the Gleditsia species showed that rpoA, rpl20, atpB, ndhA and ycf4 were subjected to positive selection, most of the chloroplast genes of Gleditsia species underwent purifying selection.The genus Gleditsia face relatively weak selection pressure.Mutational hotspots mostly occurred in "rps16-trnQ", "trnT-trnL", "ndhG-ndhI", "rpl32-trnL" and other intergenic regions in Gleditsia.Phylogenetic analysis shows that G. fera was most closely related to G. sinensis, G. japonica and G. delavayi were relatively close, Zhū Yá Zào can be considered a bud mutation of the G. sinensis.

Figure 1 .
Figure 1.Analysis of SSR sites and repetitive sequences in 9 chloroplast genomes.(a): Distribution of SSRs in the Gleditsia and two plastomes from Gymnocladus; (b): Number of different SSRs loci types; (c): Number of different repeats types.Note In a, different shapes represented the position of SSR, and the proportion of text displayed; In (c), F: forward repeats, P: palindromic repeats, R: reverse repeats, C: complementary repeats.

Ala
Arg Asn Asp Cys Gln Glu Gly HIS Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val RSCU

Figure 3 .
Figure 3. Codon bias analysis and selective pressures in the evolution.(a): Codon content of 20 amino acids and stop codons in all protein-coding genes of Gleditsia chloroplast genome; (b): Distribution of codon preference in Gleditsia; (c): Ka/Ks values of protein-coding genes of the seven comparative combinations.Note In the a, the top panel shows the RSCU for the corresponding amino acids, the colored block which are shown in the below represent different codons; In (c), Ka: nonsynonymous; Ks: synonymous.

Figure 4 .
Figure 4.Nucleotide diversity of chloroplast genomes in Gleditsia.(a): Pi in CDS; (b): Pi in intergenic regions; (c): chloroplast genome Pi values.Note G. sinensis was used as a reference genome for comparison, window length: 300 bp, step length: 200 bp; X axis: position of the midpoint of each window; Y axis: Pi of each window.

Figure 5 .
Figure 5. Gleditsia phylogenetic tree analysis using the maximum likelihood (ML).(a): Phylogenetic analysis based on chloroplast genome sequence; (b): Phylogenetic analysis based on shared CDS sequence.