The complete chloroplast genome of Stryphnodendron adstringens (Leguminosae - Caesalpinioideae): comparative analysis with related Mimosoid species

Stryphnodendron adstringens is a medicinal plant belonging to the Leguminosae family, and it is commonly found in the southeastern savannas, endemic to the Cerrado biome. The goal of this study was to assemble and annotate the chloroplast genome of S. adstringens and to compare it with previously known genomes of the mimosoid clade within Leguminosae. The chloroplast genome was reconstructed using de novo and referenced-based assembly of paired-end reads generated by shotgun sequencing of total genomic DNA. The size of the S. adstringens chloroplast genome was 162,169 bp. This genome included a large single-copy (LSC) region of 91,045 bp, a small single-copy (SSC) region of 19,014 bp and a pair of inverted repeats (IRa and IRb) of 26,055 bp each. The S. adstringens chloroplast genome contains a total of 111 functional genes, including 77 protein-coding genes, 30 transfer RNA genes, and 4 ribosomal RNA genes. A total of 137 SSRs and 42 repeat structures were identified in S. adstringens chloroplast genome, with the highest proportion in the LSC region. A comparison of the S. adstringens chloroplast genome with those from other mimosoid species indicated that gene content and synteny are highly conserved in the clade. The phylogenetic reconstruction using 73 conserved coding-protein genes from 19 Leguminosae species was supported to be paraphyletic. Furthermore, the noncoding and coding regions with high nucleotide diversity may supply valuable markers for molecular evolutionary and phylogenetic studies at different taxonomic levels in this group.

The chloroplast, which is considered to have originated from free-living cyanobacteria through endosymbiosis, plays an essential role in photosynthesis and in many processes in plant cells [1][2][3] . In this evolutionary context, the chloroplast genome of angiosperms exhibit a highly conserved organization with a quadripartite structure, comprising two copies of inverted repeats (IRs), separated by large (LSC) and small (SSC) single-copy regions 4 .
The size of the circular chloroplast genome range between 120 and 160 kb in length 5 , but varies considerably both within and among plant families. For example, in the Geraniaceae, the size of the chloroplast genome ranges from 116,935 bp in Erodium carvifolium 6 to 242,575 bp in Pelargonium transvaalense (Accession: NC_031206.1 unpublished). For Leguminosae, the size ranges from 120,289 bp in Lathyrus odoratus (Accession: NC_027150.1 unpublished) up to 178,887 in Pithecellobium flexicaule 7 . The variations in size can be attributed mostly to the expansion, contraction or loss of IRs, as well as variation in length of intergenic spacers 5,8 .
Most angiosperm chloroplast genome usually contain 100-130 distinct genes, comprising of 80-90 protein coding genes and approximately 30 transfer RNA (tRNA) genes and four ribosomal RNA (rRNA) genes 9,10 . The IR region comprises a duplicated set of tRNA and rRNA genes, whereas the single copy regions mostly consists of protein-coding genes involved in cell functions, which include components of the photosynthetic machinery (such as photosystem I (PSI), photosystem II (PSII), the cytochrome b6/f complex, and the ATP synthase),

Materials and Methods
DnA extraction and chloroplast genome sequencing. Fresh young leaves of S. adstringens were collected in Niquelândia, Goiás, Brazil (Sisgen Registration: A4EE2BE). Total genomic DNA was extracted using a CTAB protocol 31 . DNA quality was evaluated using horizontal electrophoresis with 1% agarose gel. In addition, DNA was quantified through fluorometry using Qubit 2.0 (Life Technologies). Genomic library preparation was performed using a Nextera DNA Sample Preparation Kit (Illumina, San Diego, USA). The resulted library was sequenced using the HiSeq2500 platform and V4 SBS kit (Illumina) on a single lane in paired-end mode (2 × 100 bp) at the University of São Paulo (Escola Superior de Agricultura Luiz de Queiroz da Universidade de São Paulo) in Piracicaba, Brazil. chloroplast genome assembly and annotation. Paired-end Illumina raw reads were filtered and trimmed using Trimmomatic V.0.36 32  The chloroplast genome of S. adstringens was reconstructed using a combination of de novo and reference-guided assemblies. To obtain the de novo chloroplast genome assembly, the paired-end sequence reads were mapped to five Mimosoid plastomes using Bowtie2 v.2.3.4.1 33 35 using Piptadenia communis Benth. as reference chloroplast genome. Contigs with coverage below than 10x were eliminated. The remaining de novo contigs were merged with reference-guided contigs in Sequencher 5.4.6 (Genecodes, Ann Arbor, Michigan, USA) based on at least 20 bps overlap and 98% similarity. Any discrepancies between de novo and reference-guided contigs were corrected by searching the high quality read pool using the UNIX 'grep' function. A "genome walking" technique, using the Unix "grep" function, was used to find reads that could fill any gaps between contigs that did not assemble in the initial set of analyses. Assembly curation was performed by aligning sequencing reads on the chloroplast using the Bowtie2 program. Sequencing depth was measured using the samtools platform (samtools.sourceforge.net/). Additionally, we also compared the position of the chloroplast genome regions of S. adstringens related species in circle alignment graphs made with the Circus program (http://circos.ca/). Annotation of the chloroplast genome was performed using Verdant 36 and Dual Organellar Genome Annotator-DOGMA 37 , coupled with manual correction of start and stop codons and intron/exon boundaries. Transfer RNA (tRNA) genes were identified with DOGMA and the tRNAscan-SE program ver. 2.0 38 in organellar search mode with default parameters. The circular chloroplast genome map was drawn using OrganellarGenomeDRAW (OGDRAW) 39 . The codon usage analysis was performed in the web server Bioinformatics (https://www.bioinformatics.org/sms2/codon_usage.html). characterization of repeat sequences. The sizes and locations of forward, reverse, palindromic and complementary repeats in the S. adstringens chloroplast genome were determined by REPuter 40 with a minimal size of 30 bp, hamming distance of 3 and over 90% identity. Simple sequence repeats (SSRs) were detected using the microsatellite identification tool MISA (available online: http://pgrc.ipk-gatersleben.de/misa/misa.html). The minimum number of SSRs was set to ten repeat units for mononucleotide, five repeat units for dinucleotide, four repeat units for trinucleotide and three repeat units for tetra-, penta-and hexanucleotide.
nucleotide Diversity and Synonymous (Ks) and non-synonymous (Ka) substitution rate analysis. The complete chloroplast genome sequence of S. adstringens was compared with the chloroplast genome sequences of five Mimosoid chloroplast genomes used for assembling. To assess the complete nucleotide diversity (Pi) among the complete chloroplast genome of the six species, the complete chloroplast genome sequences were aligned using MAFFT aligner tool 41 , and manually adjusted with Bioedit 42 . We then performed a sliding window analysis to calculate the nucleotide variability (Pi) values using DnaSP 6 43 with window lenght of 600 bp and step size of 200 bp. The 77 protein-coding genes were extracted and aligned separately using MAFFT to estimate the synonymous (Ks) and non-synonymous (Ka) substitution rates. The Ka/Ks for each gene were estimated in DnaSP 6. comparative analysis of genome structure. The mVISTA program was applied to compare the complete chloroplast genome of S. adstringens against the whole chloroplast genome of the five mimosoid species using the shuffle-LAGAN mode 44 . The annotated S. adstringens chloroplast genome was used as reference. The expansion and contraction of the IR regions at junction sites between the six mimosoid species were verified and plotted using IRscope 45 . phylogenetic analyses. Seventy-three protein-coding genes were recorded from 19 species within the Leguminoseae -Caesalpinioideae, as well as from two outgroups (Cucumis sativus L. and Fragaria vesca L.). All genes sequences were obtained from GenBank (see Supplementary Table S6 for accession numbers). The accD, ycf1, rps16, rps12 genes were not considered for phylogenetic analysis as they were not present in all chloroplast genomes among the species under analysis.
The nucleotide sequences were aligned using MAFFT 41 with default parameters. The Akaike Information Criterion (AIC) in JModelTest v2.1.10 was used to determine the best-fitting model of molecular evolution for each gene 46 (models selected can be seen in Supplementary Table S7). The alignments from the 73 protein-coding genes were concatenated and a Bayesian inference was performed using BEAST v1.10.1 47 at the XSEDE Teragrid of the CIPRES science gateway 48 (available online: www.phylo.org). The Markov chain Monte Carlo (MCMC) was set to run 50.000.000 generations and sampled every 1.000 generations, under a strict clock approach using the Yule speciation tree prior with the evolutionary models. The Convergence of parameters during MCMC runs were assessed by their Effective Sample Size (ESS) > 200 using TRACER v1.7.1 49 . The phylogenetic tree was annotated as a maximum clade credibility tree using TREEANNOTATOR v.2.5.0 (part of the BEAST package), with burn in of 20%. The final tree was produced using FigTree v.1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/).

Results and Discussion
Genome assembly and annotation. A total of 563,117,260 raw Illumina paired-end reads from the S. adstringens genome were generated and filtered against Mimosoid chloroplast genomes. After trimming adapters, low quality bases, and mapping the reads to the reference set, a total of 10,549,708 reads were used to assemble the chloroplast genome. The filtered reads were assembled into 57 contigs with at least 10x of coverage with a total length of 271,468 bp and an N50 of 10,881. The reference guided assembly produced 64 contigs with an N50 of 3,938 (half of the assembly contained at least 3,938 bp). The final assembly from both approaches resulted in a single chromosome and was benchmarked based on the distribution of sequencing coverage by base. After that, for validation, a set of 18,937,635 raw paired-end Illumina reads were well aligned in the chloroplast genome. The average sequencing depth was 11,470X , with a standard deviation of 3,720 (median: 12,009; mode: 12,448; minimum: 29 and maximum: 72,245) ( Supplementary Fig. 1A,B). The pairwise alignments with species close related to S. adstringens showed a high conservation of the general structure of the chromosome and correct arrangement of the chloroplast regions validating the proposed genome (Supplementary Fig. 2A-D). The sequence of the chloroplast genome was deposited in GenBank (accession number: MN196294).
The complete chloroplast genome of S. adstringens was assembled with a total of 162,169 bp in size, divided in four regions, which included a large single-copy (LSC) region of 91,045 bp, a small single-copy (SSC) region of 19,014 bp, separated by two inverted repeat (IR) regions of 26,055 bp each ( Fig. 1; Table 1). Analogous to most angiosperms, the S. adstringens chloroplast genome comprises a single circular molecule with a quadripartite structure 4 and it is similar in size from other species from the Mimosoid clade (Leguminosae; Table 1; see also 7 ).
The assembled chloroplast genome contained 111 different genes, with 77 protein-coding genes, 30 transfer RNA (tRNA) and 4 ribosomal RNA genes (rRNA) ( Fig. 1; Table 2). A total of nine protein-coding genes and 6 tRNAs genes contained a single intron, whereas three genes (rps12, clpP and ycf3) exhibit two introns each (Supplementary Table S2). The rps12 gene was predicted to be trans-spliced, with the 5′ end located in the LSC region and the duplicated 3′ end in the IR region. The trnK-UUU has the largest intron encompassing the matK gene, with 2,558 bp, whereas the intron of trnL-UAA is the smallest (513 bp).  www.nature.com/scientificreports www.nature.com/scientificreports/ Codon usage analysis performed using 77 protein coding gene sequences identified a total of 20,986 codons in the S. adstringens chloroplast genome (Table 3). Most identified codons are coders for amino acid leucine (2,227 codons, ~ 10.6% of the total number of codons) and the most abundant codon was TTA (33% of codons encoding leucine). The codons encoding the amino acid cysteine were identified as the least abundant in the S. adstringens chloroplast genome (256 codons, ~1.2%). Moreover, only one codon was identified for the coding of methionine (ATG) and tryptophan (TGG) amino acids.
It is important to point out that the structure of the chloroplast genome in species from Leguminosae family are highly variable because of either expansion or contraction of the IR 7,50,51 . This is mainly caused by the gene transfer from single copy regions to inverted repeat and vice versa in the boundaries of IRs, during evolution 51 . With the exception of Acacia ligulata 52 , the legume plastomes of all species documented to date within the clade formed by Ingeae and Acacia species, have IRs ca. 13 kb larger, and a SSC correspondingly smaller, than other legumes 7,50 . In addition, the size variation of the plastid genome has been explained, at least in part, by the loss of one IR. However, the mechanisms that led to IR loss are still unknown. Examples of variation within the Leguminosae include the loss of the IR in the monophyletic group within the subfamily Papilionoideae (including Trifolium, Pisum, Cicer, Medicago, Glycyrrhiza and Vicia), known as the "inverted repeat lacking clade" 53,54 .

Repeat sequences analysis.
A total of 42 repeat structures, with lengths ranging from 30 bp to 128 bp, were detected in the S. adstringens chloroplast genome. They included 22 (52.38%) forward repeats, 18 (42.86%) palindromic repeats, and two (4.76%) reverse repeats (Supplementary Table S3), whereas none complementary structure was identified. The forward repeats ranged from 30 bp to 128 bp. The palindromic repeats were 30 bp to 60 bp, whereas the two reverse repeats were 30 bp and 38 bp (Supplementary Table S3). In the majority of Mimosoid species, the most abundant dispersed repeat identified were forward, then palindromic and the least was reverse 7 .
Among the 42 repeats, 64.29% are located in the LSC region, 14.29% in the SSC region and 16.67% in the IR region. Two repeats, which were 39 bp and 52 bp, located in the intron region of ycf3 and rpl16, was found to repeat thrice (LSC/SSC/IR) and twice (LSC/SSC), respectively, as forward repeat. Most of the repeats (80.95%) were found in the intergenic spacer regions (IGS), whereas 19.05% were located in the introns (ndhA, ycf3 and rpl16) and only two were located in coding region (psaB and trnS-GCU) (Supplementary Table S3).
A total of 137 SSRs were detected in the S. adstringens chloroplast genome, which were composed by a length of at least 10 bp and repeated 3 to 21 times. Among them, 90 (65.69%) were mono-repeats, 19 (13.87%) were di-repeats, nine (6.57%) were tri-repeats, 11 (8.03%) were tetra-repeats, seven (5.11%) were penta-repeats and one (0.73%) was hexa-repeat. The majority (89 or 98.89%) of the mononucleotide repeats consisted of A/T motifs, while only one was composed of a G/C motif. Likewise, most of the dinucleotides and trinucleotides were Category Gene groups Name of genes  AT-rich, being composed of AT/TA (84.21%) and AAT/TTA (77.78%) repeats, respectively (Fig. 2). These results showed that the SSRs exhibit a strong AT bias, which is consistent with the observed in other Leguminosae species, such as Vigna radiata 55 , Cajanus cajan 10 , Cajanus scarabaeoides 10 and Arachis hypogaea 56 . Concerning genomic localization, among the 137 SSRs, 114 (83.21%) were found in the LSC region, 17 (12.41%) in the SSC region and six (4.38%) in the IR region (Supplementary Table S4). Most of these SSRs were located in intergenic regions (102 or 74.45%), while 17 (12.41%) were in the introns and 18 (13.14%) were in the protein-coding genes (Supplementary Table S4). The ycf1 gene contained more SSRs than the other genes (Supplementary Table S4). In addition, the results found herein are in agreement with those from Cajanus cajan 10 , Cajanus scarabaeoides 10 , Vigna radiata 55 and Glycine species 57 . The ycf1 encodes a protein of approximately 1,800 amino acids. The ycf1 located in the IR region is short and conserved, while the located in SSC region are extremely variable in seed plants 58,59 . Some studies reported that this region is the most variable locus for the design of primers, as well, as more variable than matK in many taxa, and thus suitable for molecular systematics at low taxonomic levels 58,60 . nucleotide diversity and Ka/Ks ratio. The average nucleotide variability (Pi) among the five chloroplast genomes of Mimosoid species was estimated to be 0.01771, ranging from 0 to 0.172. The nucleotide variability was higher in the SSC (Pi = 0.02712) and LSC (Pi = 0.02488), when compared to IR regions, which had a much lower nucleotide diversity (Pi = 0.00339). Similar results were obtained in the comparison of chloroplast genome sequences among Aconitum L. species, in which the average Pi in the IR region was 0.00146, whereas in the LSC and SSC regions the diversity estimates were 0.007140 and 0.008368, respectively 61 . Park et al. (2018) analyzed the nucleotide diversity in six Ipomoea L. chloroplast genome and also found that the IR regions were more conserved than the LSC and SSC regions, with average Pi values of 0.003 for IR and more than 0.006 for SSC and LSC regions 62 .
The non-synonymous (Ka) to synonymous (Ks) rate ratio (Ka/Ks) was calculated for the 77 protein-coding genes in common across all six chloroplast genomes ( Fig. 4 and Supplementary Table S5). The synonymous substitution does not change the amino acid within a peptide chain, whereas nonsynonymous substitution does. The rpl32 gene associated with large subunit of ribosomal proteins had the highest synonymous rate, 0.09263, while the ycf1 gene with unknown functions, had the highest nonsynonymous rate, 0.03534.
The Ka/Ks ratio may indicate whether selective pressure is acting on a particular protein-coding gene. A Ka/ Ks > 1 indicates that the gene is affected by positive selection, whereas Ka/Ks < 1 indicates that the gene is affected by negative selection or purifying selection. A value of 0, indicates the presence of neutral selection 64 . Concerning the different regions of chloroplast genomes, the Ka/Ks ratio were highest on average in the SSC region (0.2991) and lowest in the IR region (0.2856) and LSC region (0.2708). The lowest Ka/Ks ratio was observed for genes encoding subunits of ATP synthase, subunits of the cytochrome b/f complex, subunits of the large subunit of ribosomal proteins and subunits of photosystem II ( Fig. 4; Supplementary Table S5).
Herein, the Ka/Ks ratio was calculated to be 0 for 13 genes, two inside the IR region (rpl23, rps12), ten in the LSC region (rpl36, psbM, psbZ, psbJ, psbF, psbT, psbN, petL, petG, atpH) and one in the SSC region (psaC) (Fig. 4). This occurred because the Ka or Ks is 0 or extremely low, thus Ka/Ks ratio could not be calculated 65,66 . Among the 77 protein-coding genes, Ka/Ks indicates purifying selection in 62 of them ( Fig. 4; Supplementary Table S5). The Ka/Ks ration indicates positive selection for three genes analyzed, one of it is associated with small subunit of ribosomal proteins (rps16), the other is associated with Photosystem II (psbH) and the third with the clpP proteases ( Fig. 4; Supplementary Table S5)   In addition, it was demonstrated that Leguminosae chloroplast have regions with accelerated mutation rates, including genic regions such as the clpP in Mimosoids and rps16 in the IRLC clade 50,52,68 . The rps16 gene encodes the ribosomal protein S16 and it is present in the chloroplast genome of the majority of higher plants. Moreover, a multiple gene-loss event of rps16 was reported for various legumes lineages 69,70 . For instance, in the Leguminosae family,  comparative analysis of genome structure. The structural characteristics in chloroplast genome among Mimosoid species revealed that gene coding regions were more conserved than the noncoding regions, and IRs were more conserved than LSC and SSC regions (Fig. 5). This result is consistent with the pattern revealed in other Leguminosae species, such as Glycine species 77 and species from Caesalpinioideae subfamily 7 . Additionality, it was also observed that the intergenic spacers regions between several pairs of genes varied greatly, for example, between matK-rps16, rps16-psbK, trnS-GCU-trnG-UCC, atpH-atpI, trnC-GCA-petN, psbZ-trnG, trnT-UGU-trnL-UAA, ndhC-trnV-UAC and rps3-rps19. Some of these intergenic spacer regions had also the highest level of nucleotide diversity.
The expansion and contraction of the IR region and the single-copy boundary regions result in change in the position of the junction sites, which is considered as a primarily mechanism causing length variation of chloroplast genomes in higher plants 78 . The length of the IR regions was similar, ranging from 26,006 bp in Parkia javanica to 26,143 bp in Dichrostachys cinerea (Fig. 6). Wang et al. (2017) observed that species belonging to tribe Mimoseae had canonical IRs, whereas species belonging to tribes Ingae and Acacieae had much long IRs, as all of them experienced ca. 13 kb IR expansion into SSC previously reported by Dugas et al. (2015).  www.nature.com/scientificreports www.nature.com/scientificreports/ The endpoint of the Mimoseae JLA (Junction between IRa and LSC) was located upstream of the rps19 and downstream of the trnH-GUG. Similar patterns were observed by Amiryousefi et al. (2018b) in Solanaceae chloroplast genomes 79 . The junction between IRb and SSC region (JSB) was located in the intergenic ycf1/ndhF, and the distance between the ndhF end to the junction of the IRb/SSC differs by 11 bp in Dichrostachys cinerea to 150 bp in Adenanthera microsperma. The junction between IRa and SSC (JSA) was located within the ycf1 gene, and the fragment located at the IRa region ranged from 692 bp to 779 bp (Fig. 6). The gene rps19 crossed the LSC/ IRb region and the extent of the IR expansion to rps19 slightly varies among the Mimoseae species ranging from 101 bp to 105 bp. phylogenetic relationships. In this study, 19 species from Caesalpinioideae and two outgroups were analyzed based on 73 protein-coding genes of their chloroplast genomes. The total concatenated alignment length from the 73 protein-coding genes was 60,736 bp. The reconstructed phylogeny indicated that Caesalpinioideae was paraphyletic and that the species from tribe Mimoseae (Adenanthera microsperma, Dichrostachys cinerea, Leucaena trichandra, Parkia javanica, Piptadenia communis and S. adstringens) were deemed non-monophyletic (Fig. 7). All nodes were strongly supported, given the Bayesian posterior probability (Fig. 7). These results are consistent with those from Wang et al. (2017) and support the new classification system proposed for the Leguminosae 21 .

conclusion
In this work, we assemble the complete chloroplast genome of S. adstringens with 162,169 bp. Genome gene contents and orientation are similar to those found in the chloroplast genome of other Mimosoid (Leguminosae) species. This study also revealed the distribution and location of repeated structures and microsatellites along the chloroplast genome of S. adstringens. We also generated important genomic resources for Mimosoid group. Moreover, the Ka/Ks ratio was lower in the LSC region compared to SSC region. As expected, the comparison with other five Mimosoid species revealed that the coding regions are more conserved than non-coding regions, and IRs more conserved than LSC and SSC regions. Finally, the phylogenetic relationships built for 19 species of Caesalpinioideae, including the new data from S. adstringens and two outgroups, were fully resolved with high supports based on 73 conserved protein-coding genes. The maximum credibility tree revealed that the tribe Mimoseae is paraphyletic, consistent with the new classification proposed for the Leguminosae.

Data Availability
The complete chloroplast sequence generated and analyzed during the current study are available in GenBank, https://www.ncbi.nlm.nih.gov (accession numbers are described in the text).