Insights into molecular structure, genome evolution and phylogenetic implication through mitochondrial genome sequence of Gleditsia sinensis

Gleditsia sinensis is an endemic species widely distributed in China with high economic and medicinal value. To explore the genomic evolution and phylogenetic relationships of G. sinensis, the complete mitochondrial (mt) genome of G. sinensis was sequenced and assembled, which was firstly reported in Gleditsia. The mt genome was circular and 594,121 bp in length, including 37 protein-coding genes (PCGs), 19 transfer RNA (tRNA) genes and 3 ribosomal RNA (rRNA) genes. The overall base composition of the G. sinensis mt genome was 27.4% for A, 27.4% for T, 22.6% for G, 22.7% for C. The comparative analysis of PCGs in Fabaceae species showed that most of the ribosomal protein genes and succinate dehydrogenase genes were lost. In addition, we found that the rps4 gene was only lost in G. sinensis, whereas it was retained in other Fabaceae species. The phylogenetic analysis based on shared PCGs of 24 species (22 Fabaceae and 2 Solanaceae) showed that G. sinensis is evolutionarily closer to Senna species. In general, this research will provide valuable information for the evolution of G. sinensis and provide insight into the phylogenetic relationships within the family Fabaceae.

www.nature.com/scientificreports/ G. sinensis, a kind of Fabaceae plant widely distributed throughout China 21 , provide a wide array of benefits. It plays an important role in the conservation and maintenance of soil and water resources due to its drought resistance and low requirements on the soil. In addition, G. Sinensis saponin is effective in decontamination, foaming, which is widely used in the production of cosmetics and detergents with high economic value 22 . The fruits and thorns of G. Sinensis with remarkable antioxidant, anti-tumor, antiviral, antibacterial, and anti-allergic activities 23 , are used as medicinal herbs in China and have been used in the treatment of cancer, carbuncles, skin diseases as well as other diseases 23,24 . However, its mt genome has not been determined, which highly limits the process of molecular research on G. sinensis.
In this study, we assembled the complete mt genome of G. sinensis, which is the first mt genome for Gleditsia. We analyzed its gene content, repeat sequences, codon usage bias, synonymous and nonsynonymous substitution rate. Besides, gene loss and phylogenetic analyses were performed by comparisons with other Fabaceae plants mt genomes. Our data will provide valuable information for studying the evolutionary processes of the G. sinensis mt genome.

Results and discussion
Genome features. The complete mt genome of G. sinensis is 594,121 bp in length with a circular structure ( Fig. 1), and its size is similar to the mt genomes of some Fabaceae plants, such as V. faba (588,000 bp) 25 , L. coriaria (601,574 bp) 26 and T. indica (607,282 bp) 26 . The base composition is as follows: A (27.4%), T (27.4%), C (22.7%), G (22.6%), and GC content is 45.3%. A total of 57 genes were identified in the G. sinensis mt genome, including 37 PCGs, 19 transfer RNA genes and 3 ribosomal RNA genes ( Table 1). As shown in Table 2, the PCGs Figure 1. Genome map of the G. sinensis mt genome. Genes belonging to the functional group are color-coded on the circle as transcribed clockwise (outside) and transcribed counter-clockwise (inside). The darker gray in the inner circle represents the GC content, while the lighter gray represents the AT content. www.nature.com/scientificreports/ in the G. sinensis mt genome account for 5.11% of the entire genome with a total length of 30,336, while noncoding regions account for 90.65% of the entire genome, with a total length of 538,550. The total length of tRNA genes and rRNA genes comprise 0.24% and 0.85% of the entire mt genome, respectively. There exist 12 introns in 37 PCGs, accounting for 3.16% of the genome. Among them, nad2, nad5, ccmFc, rps3 and rps10 contain one intron, and nad4 and nad7 contain three and four introns, respectively (Table 1). Additionally, a protein-coding gene (nad1) and three tRNA genes (trnP, trnD, trnS) were found to contain two copies (Table 1).

Codon usage analysis.
Relative synonymous codon usage (RSCU) refers to the relative probability of a specific codon between the synonymous codons encoding the corresponding amino acid 27 . RSCU = 1 indicates that there is no preference for codon usage, while RSCU > 1 indicates that the codon is a used relatively frequently codon 28,29 . The 37 PCGs of the G. sinensis mt genome contained 10,112 codons (Supplementary  Table S1). Among them, 1057 (10.45%) encoded leucine (Leu) while only 147 (1.45%) encoded cysteine (Cys), which were the most and least used amino acids in the G. sinensis mt genome, respectively (Table S1). The AT content of the first, second, and third codon positions was 52.01%, 56.59% and 61.26%, respectively. The high AT content at the third codon position was similar to other reported higher plants mt genomes 26,27 . Apart from UGG, all preferred synonymous codons (RSCU > 1) end in either A or U (Fig. 2).
Repeat sequences. The angiosperms mt genomes are characterized by repeat sequences, which play an important role in biological evolution, genetic regulation and gene expression 32,33 . SSRs are tandem repeats with 1-6 nucleotides as the basic unit 34 , which are particularly abundant in plant genomes and have an important impact on the function and evolution of the genome 21 . SSRs are generally used as DNA markers for population genetic studies due to the advantages of high polymorphism 35 . In the present study, we identified 71 SSRs with a total length of 718 bp, including 11 dinucleotides (15.49%), two trinucleotides (2.82%), and 58 mononucleotides (84.51%), while tetranucleotides, pentanucleotides, and hexanucleotides were not identified in the mt genome ( Fig. 3A). Among them, the most abundant repeat sequences were mononucleotides, which suggests that mononucleotide repeats may contribute more to genetic variation than other SSRs 36 . Further analysis of the repeat unit of SSRs showed that 80.28% of mononucleotides were A/T, while G/C only accounted for 4.23% (Table S2). The higher AT contents in mononucleotide repeats of G. sinensis mt genome was congruent with other reported Fabaceae plants 37 . The identification of featured SSRs in this study can provide valuable resources on developing markers for phylogenetic research and population studies of G. sinensis. Table 1. Gene annotation of the G. sinensis mt genome. a Genes with one intron, b genes with at least two introns.

Category Group Genes
Mitochondrial respiratory chain related genes  www.nature.com/scientificreports/  www.nature.com/scientificreports/ The sequences with a repeat unit longer than 30 bp were regarded as long repeats, including forward repeats (F), palindromic repeats (P), reverse repeats (R) and complement repeats (C). We identified 50 long repeat sequences in G. sinensis mt genome, ranging from 86 to 270 bp, including 26 forward repeats and 24 palindromic repeats. Most long repeats were 80-119 bp in length, and only 7 repeats were longer than 150 bp (Fig. 3). Repeat sequences, especially long repeats, have important impacts on the structure of plant mt genomes, and they are positively correlated with the size of the genome 12 .
Synonymous and nonsynonymous substitution rate. The calculation of Ka/Ks ratio is important for understanding the dynamics of molecular evolution 38,39 . This ratio can infer whether the PCGs are under selective pressure. Ka/Ks = 1 indicates neutral mutation, Ka/Ks < 1 indicates negative (purifying) selection, and Ka/ Ks > 1 indicates positive (diversifying) selection. In this study, all of the PCGs of the G. sinensis mt genome were used to calculate the Ka/Ks ratios. As shown in Fig. 4, the Ka/Ks ratios of most PCGs were less than 1, indicating that most of the PCGs were under purification selection. These mitochondrial genes that experienced purification selection may play a vital role in stabilizing the normal function of mitochondria 37 . In addition, the Ka/Ks ratios of atp4, atp6, atp8, cox1, matR, nad1, nad4, nad4L, nad5, rps1 were all greater than 1, and almost all of these genes belong to mitochondrial respiratory chain related genes category, indicating that they were under positive selection, which suggests that some advantages had emerged during evolution 37 .
Gene loss. During the evolution of the angiosperm mt genome, the loss of PCGs occurred frequently 40,41 .
In this study, we compared the distribution of PCGs in 22 Fabaceae plant mt genomes (Table S3). As shown in Fig. 5, most PCGs were conserved, especially for mitochondrial respiratory chain related genes, maturase and methyltransferase genes. In contrast, the ribosomal protein and succinate dehydrogenase genes were highly variable. The rpl2, rpl10, rpl14, rps7, rps11, rps19, sdh3, sdh4 genes were lost in most mt genomes, which is understandable because that ribosomal protein and succinate dehydrogenase genes are frequently lost or transferred to the nucleus during the evolution of angiosperm mt genomes (e.g. rps10, rpl2, sdh3, sdh4) 26,41,42 . A total of five genes were lost in the G. sinensis mt genome, including four ribosomal protein genes (rpl10, rps4, rps10, rps19) and one succinate dehydrogenase gene (sdh4). The rps10 gene was only lost in G. sinensis but it was retained in other Caesalpinioideae species. In addition, we found that the rps4 gene was only lost in G. sinensis but it was retained in other Fabaceae species. Interestingly, this gene has not been found lost in other plant mt genomes, yet. Therefore, it is an open question as to whether rps4 was lost for the reason that its function may no longer be needed for G. sinensis, or whether it was functionally transferred to the nucleus 43,44 . Phylogenetic analyses. The higher plant mt genomes evolve slowly, and its mutation rate is significantly low 1,8,10 , which makes it a useful tool for phylogenetic research 45 atp6, ccmB, ccmC, ccmFn, cox1, cox3, matR,  nad2, nad4, nad5, nad6, nad7, nad9, rpl16, rps3, rps4, rps12). The ML and BI trees shared a consistent typology. As shown in Fig. 6, all Fabaceae plants were clustered within a lineage distinct from the outgroup. Most nodes in the ML and BI trees had high support values (bootstrap proportions ≥ 75, posterior probabilities ≥ 0.963), whereas the support value of the clade Detarioideae and Caesalpinioideae was only 53 in the ML tree. The phylogenetic relationship of the four subfamilies was described as (Cercidoideae + (Papilionoideae + (Detarioideae + Caesalpinioideae))). The tree strongly support the separation of Cercidoideae from the clade (Papilionoideae, Detarioideae, Caesalpinioideae), with bootstrap proportions = 100, posterior probabilities = 1, which was consistent with a previous study report 26 . It was worth noting that the G. sinensis and two Senna species were clustered into one clade with a bootstrap support value of 87 and a posterior probability of 1, which indicates www.nature.com/scientificreports/ that G. sinensis were evolutionarily closer to Senna species within the Fabaceae family. The phylogenetic tree constructed in this study could not reflect the true phylogenetic relationship of Fabaceae for the fact that few Fabaceae mt genomes have been sequenced. To illustrate more accurately the evolutionary relationship among Fabaceae species, it is necessary to use more species to analyze the phylogeny.

Mitochondrial genome assembly and annotation.
A total of 14,836,699 raw reads of G. sinensis were produced by Illumina pair-end sequencing, and 14,794,823 clean reads were retained after the quality checking by FastQC. The base quality value Q20 and Q30 were 94.16% and 86.35%, respectively. Subsequent analyses were based on the filtered high-quality sequences. For mt genome assembly, high-quality DNA sequencing reads were mapped to reference mt genome of Senna occidentalis (NCBI accession number NC_038221) using Geneious 46 to get the sequence of cox1, the number of iterations was set to 5 times. Then, the G. sinensis mitochondrial genome was de novo assembled using NOVOPlasty3.7.2 47 , with cox1 sequence set as seed and K-mer length of 39. The N50 and N90 of the obtained contigs were 64428 bp and 18438 bp, respectively. In order to obtain a high-quality mt genome, the base of the genome was corrected based on high-quality DNA sequencing data by using BWA software [48][49][50] , and a total of 96 bases were corrected. Finally, to determine whether the assembled contig is a circular structure, we designed primers based on the base sequence at the head and tail of the contig and performed PCR amplification (Table S4). The results confirmed that the G. sinensis mt genome was a typical circular molecule ( Figure S1). The mt genome was annotated using MITOFY 12 (http:// dogma. ccbb. utexas. edu/ mitofy/) and GeSeq 51 (https:// chlor obox. mpimp-golm. mpg. de/ geseq. html) and was manually checked and adjusted the annotation using Senna occidentalis as the reference sequence. The online tRNAscan-SE search server (http:// lowel ab. ucsc. edu/ tRNAs can-SE) was used to annotate the tRNA gene to determine its position, and the parameter settings were default. The start and stop codons of protein-coding genes were manually adjusted to fit open reading frames. The mt genome of G. sinensis was visualized using OGDRAW 52 . The mt genome of G. sinensis was deposited in NCBI GenBank under accession number MT921986.
Codon usage and substitution rate calculation. The relative synonymous codon usage (RSCU) was calculated by MEGA X 53 . The Ka/Ks ratios were calculated individually on each protein-coding gene of G. sinensis by DnaSP v6 54 , and Acacia ligulata (NCBI accession number NC_040998) was used as an outgroup.

Phylogenetic analyses.
To better infer the phylogenetic relationship within the Fabaceae family, 24species (22 Fabaceae species and 2 outgroups) were selected to construct a phylogenetic tree. We extracted the nucleotide sequences of shared PCGs from these mt genomes. The 17 shared PCGs were aligned individually using PhyloSuite v1.2.1 57 , and the alignment was manually adjusted. All aligned PCGs were then concatenated. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.