The complete chloroplast genome sequence of the medicinal plant Sophora tonkinensis

Sophora tonkinensis belongs to genus Sophora of the Fabaceae family. It is mainly distributed in the ridge and peak regions of limestone areas in western China and has high medicinal value and important ecological functions. Wild populations of S. tonkinensis are in danger and need urgent conservation. Furthermore, wild S. tonkinensis resources are very limited relative to the needs of the market, and many adulterants are present on the market. Therefore, a method for authenticating S. tonkinensis and its adulterants at the molecular level is needed. Chloroplast genomes are valuable sources of genetic markers for phylogenetic analyses, genetic diversity evaluation, and plant molecular identification. In this study, we report the complete chloroplast genome of S. tonkinensis. The circular complete chloroplast genome was 154,644 bp in length, containing an 85,810 bp long single-copy (LSC) region, an 18,321 bp short single-copy (SSC) region and two inverted repeat (IR) regions of 50,513 bp. The S. tonkinensis chloroplast genome comprised 129 genes, including 83 protein-coding genes, 38 transfer RNA (tRNA) genes, and 8 ribosomal RNA (rRNA) genes. The structure, gene order and guanine and cytosine (GC) content of the S. tonkinensis chloroplast genome were similar to those of the Sophora alopecuroides and Sophora flavescens chloroplast genomes. A total of 1,760 simple sequence repeats (SSRs) were identified in the chloroplast genome of S. tonkinensis, and most of them (93.1%) were mononucleotides. Moreover, the identified SSRs were mainly distributed in the LSC region, accounting for 60% of the total number of SSRs, while 316 (18%) and 383 (22%) were located in the SSC and IR regions, respectively. Only one complete copy of the rpl2 gene was present at the LSC/IRB boundary, while another copy was absent from the IRA region because of the incomplete structure caused by IR region expansion and contraction. The phylogenetic analysis placed S. tonkinensis in Papilionoideae, sister to S. flavescens, and the genera Sophora and Ammopiptanthus were closely related. The complete genome sequencing and chloroplast genome comparative analysis of S. tonkinensis and its closely related species presented in this paper will help formulate effective conservation and management strategies as well as molecular identification approaches for this important medicinal plant.

Scientific RepoRtS | (2020) 10:12473 | https://doi.org/10.1038/s41598-020-69549-z www.nature.com/scientificreports/ and serious habitat destruction (Fig. 1C), and its wild populations have been seriously shrinking. However, little is known regarding its genetic background. The plant chloroplast genome, with a length of 110-160 kb, is a valuable source of genetic markers for phylogenetic analyses, genetic diversity evaluation, and plant molecular identification due to its conserved structure and comparatively high substitution rate 6,7 . Therefore, a good understanding of chloroplast genomic information will make it easy to study genetic variation in and design reasonable conservation strategies for wild populations of S. tonkinensis. Furthermore, there are many adulterants of S. tonkinensis on the market, and it is difficult to distinguish them according to outward appearance 8 , indicating an urgent need for a molecular approach with which to differentiate S. tonkinensis species from other adulterating species. DNA barcode sequence analysis, a molecular identification technology, can provide a rapid, accurate, and automatable method of species identification using a standardized piece of DNA sequence [9][10][11] . Chloroplast non-coding regions have been successfully applied in DNA barcoding research. Yao et al. found that the psbA-trnH intergenic spacer region could be used as a barcode to distinguish various Dendrobium species and to differentiate them from adulterating species 12 . Chen et al. tested the discrimination ability of ITS2 in more than 6,600 plant samples belonging to 4,800 species from 753 distinct genera and found that the rate of successful identification with ITS2 was 92.7% at the species level 13 . Chloroplast genomic information for S. tonkinensis will provide candidate DNA barcodes for the authentication of S. tonkinensis and the identification of its adulterants.
In the present study, we assembled and analysed the chloroplast genome sequence of S. tonkinensis based on Illumina paired-end (PE) sequencing data. The sequence was also compared with other known chloroplast genome sequences using bioinformatics analysis, and the evolutionary position of S. tonkinensis among the Papilionoideae was confirmed.

Results
Genome sequencing and assembly. In this study, PE DNA sequencing was carried out using the Illumina MiSeq sequencing platform. In total, 17,594,210*2 PE reads and 5,313,451,420 bases were obtained, and a nucleotide quality score greater than 20 (Q20) was achieved at a rate of 96.92%. After quality filtering, 16,892,769*2 PE reads, 663,584 single reads, and 5,058,544,355 bases were obtained. According to the total length of the assembled sequence, number of scaffolds and scaffold N50, the assembly results for multiple K-mers were evaluated comprehensively, and then the optimal-K-mer data were selected as the final assembly results. We obtained 1 scaffold with a length of 154,644 bp. These data demonstrated a high-quality assembly.   (Table S1). The S. tonkinensis chloroplast genome contained 64 types of codons encoding 21 types of amino acids (Fig. 3). The number of codons differed from 247 to 2,320, with a fraction ranging from 0.08 to 1. The amino acids Met and Trp had only one codon, while the remaining amino acids possessed 2-6 codons.
comparison to the chloroplast genomes of other Sophora species. The size of the S. tonkinensis chloroplast genome was found to be similar to those of the Sophora alopecuroides and Sophora flavescens chloroplast genomes 14,15 (Table 3). However, the S. tonkinensis chloroplast genome had the longest LSC region (85,809 bp), whereas the S. alopecuroides chloroplast genome had the shortest LSC region (84,221 bp). As shown    4A). Of these, 21 dinucleotide, 55 trinucleotide, 6 tetranucleotide, and 4 pentanucleotide repeats were identified in the LSC region; 3 dinucleotide repeats, 5 trinucleotide repeats, and 1 pentanucleotide repeats were found in the SSC region; and 6 dinucleotide repeats, 19 trinucleotide repeats, and 1 pentanucleotide repeat were observed in the IR region ( Fig. 4B-D). The size and location of the tetra-and pentapolymers are shown in Table S2. Of these repeats, 10 and 2 were localized in intergenic spacers and coding regions, respectively, and none were found in introns. Tandem repeat sequences play a crucial role in genome rearrangement and phylogenetic analysis 16 . In the current study, a total of 23 tandem repeats were identified in the S. tonkinensis chloroplast genome (Table S3)   www.nature.com/scientificreports/ of different regions of these plastomes. The overall sequence identities of the four Papilionoideae chloroplast genomes were plotted using mVISTA with the annotation of S. tonkinensis as the reference, and we observed approximately identical gene orders and organizations among them (Fig. 5). The coding regions were found to be more highly conserved than the non-coding regions, and the two IR regions were less divergent than the LSC and SSC regions. The most divergent coding regions of the four chloroplast genomes were ycf1, ndhF, accD, rpoC2, and rpoB, and the four rRNA genes (rrn4.5, rrn5, rrn16, and rrn23) were the most conserved. IRs are the most conserved regions in the chloroplast genome, and contraction and expansion at their boundaries are common evolutionary events, representing one of the main factors affecting chloroplast genome size. Using Nicotiana tabacum as the reference species, we compared the IR/LSC and IR/SSC borders of the chloroplast genomes of S. tonkinensis, S. alopecuroides, A. mongolicus, and M. floribunda of Papilionoideae (Fig. 6). The results showed that S. tonkinensis had size differences in the LSC, SSC and IR regions compared with those in other closely related chloroplast genomes of Papilionoideae species. In all of these species, the rps19 gene was located in the LSC region. The rpl2 gene of S. tonkinensis spanned the LSC and IRB regions, while the rpl2 genes of the other species were all observed in the IRB region, with a 4-5 bp distance from the LSC/IRB border. The ycf1 pseudogene spanned the IRB/SSC boundary in all chloroplast genes, while the yfc1 pseudogene and nadH gene overlapped in A. mongolicus. The nadH gene was present in the SSC region of all genomes, with a 7-74 bp distance from the IRB/SSC junction. Expansion and contraction of the ycf1 gene were observed in the boundary regions of the SSC/IRA. Size variation in ycf1 from 5,318 to 5,708 bp was identified in all chloroplast genomes. The trnH gene was found in the LSC region of all genomes but was located 2 to 138 bp from the IRA/LSC boundary. In S. tonkinensis, the rpl2 gene was absent in the IRA region because of the incomplete gene structure caused by the expansion and contraction of IR regions.

Synonymous (K S ) and non-synonymous (K A ) substitution rate analysis.
A total of 70 genes in the chloroplast genome of S. tonkinensis were used to calculate the K A /K S ratio relative to the chloroplast genome of S. alopecuroides and S. flavescens (Fig. 7). The K A /K S ratios of most of the genes in S. tonkinensis Vs. those in S. flavescens and S. alopecuroides were consistent with negative (or purifying) selection (K A /K S < 1), while six genes (matK, psbE, psbF, psbM, psaI, and rpl36) displayed positive selection (K A /K S > 1). Notably, the K A /K S ratios of psbE, psbF, psbM, psaI, and rpl36 in the S. tonkinensis Vs. S. flavescens and S. alopecuroides comparisons were as high as 50, which indicated great evolutionary divergence in these genes. The rps2 and rpl32 genes were differentially selected: rps12 did not differ in the S. tonkinensis Vs. S. flavescens comparison, but it was positively selected in the S. tonkinensis Vs. S. alopecuroides comparison (K A /K S = 9.25). rpl32 exhibited no difference in the S. tonkinensis Vs. S. alopecuroides comparison but was negatively selected in the S. tonkinensis Vs. S. flavescens (K A /K S = 0.32) comparison.
Single nucleotide polymorphism (Snp) analysis. SNP loci are very useful resources for phylogenetic analysis and species identification 17 . To determine the differences between S. tonkinensis and the two other www.nature.com/scientificreports/ Sophora species S. alopecuroides and S. flavescens at the chloroplast genome level, SNP analysis was carried out with the chloroplast genome of S. tonkinensis as the reference sequence. The results revealed 805 SNPs were found in the intergenic region, and 485 SNPs, including 236 non-synonymous SNPs and 249 synonymous SNPs were identified in 64 protein-coding genes. Of these genes, ycf1 contained the most SNP sites (Fig. 8).
phylogenetic analysis. In the present study, we aligned 20 complete chloroplast genomes of Papilionoideae to reveal the phylogenetic position of S. tonkinensis (Fig. 9). The phylogenetic positions of these 20 chloroplast genomes were successfully resolved with full bootstrap support across almost all nodes. We found that S. tonkinensis was grouped into Sophora with S. flavescens and S. alopecuroides and S. tonkinensis exhibited the closest relationship with S. flavescens. A close relationship among the genera Sophora, Salweenia and Ammopiptanthus was also uncovered.

Discussion
Since the first sequenced plant chloroplast genome was isolated from tobacco 18 , thousands of chloroplast genomes from various species have been sequenced. As of 2019, more than 3,300 chloroplast genome sequences had been recorded in the National Center for Biotechnology Information (NCBI) database. In recent years, DNA barcoding has become a powerful tool for species identification. In plants, commonly used DNA barcodes include the chloroplast genes rbcL, matK and psba-trnh and nuclear genes ITS and ITS2 19 . Of these, ITS2 has been suggested as a universal DNA barcode for medicinal plants due to its strong identification ability 12    www.nature.com/scientificreports/ var. xanthioides and A. longiligulare) could be accurately identified using their whole chloroplast genomes 22 . Chen et al. discovered that the complete chloroplast genome can be used as a superbarcode to identify six Ligularia species 23 . The chloroplast genome could distinguish C. indicum from its closely related species and might become a potential superbarcode for the identification of these species 24 . Zhu et al. found that the complete plastome sequence dataset had the highest discriminatory power for D. officinale and its closely related species, indicating that complete plastome sequences can be used to accurately authenticate Dendrobium species 25 . The whole chloroplast genome of S. tonkinensis and its hypervariable region, including the most divergent regions (ycf1, ndhF, accD, and rpoC2), which are also the genes containing the most SNP sites, and the six positively selected genes (matK, psbE, psbF, psbM, psaI, and rpl36) could be selected as potential DNA barcodes for identification of species in future studies. Genetic variation plays an important role in the ability of plants to maintain their evolutionary potential to adapt to the ever-changing environment, therefore the maintenance of genetic variation is the main goal of the conservation strategies for most endangered species 26 . SSRs, also known as microsatellites, have high polymorphism rate at the species level [27][28][29][30] . Therefore, they have been widely used as effective molecular markers in population genetic and evolution studies 31,32 . Yang et al. used eight SSR primer to assess the genetic diversity and structure of 22 natural populations of the endangered medicinal plant Phellodendron amurense in China, and proposed proper conservation measures for this species 33 . An ex situ conservation measure for conserving genetically distant populations to maximize the genetic diversity of Eucommia ulmoides is recommended based on the genetic analysis diversity within and among the semi-wild and cultivated populations of E. ulmoides using two cpSSR loci 34 . In the S. tonkinensis chloroplast genome, five types of SSRs (mono-, di-, tri-, tetra-, and penta-nucleotide repeats) and a total of 150 SSR loci with a length of at least 10 bp were identified (Table S4). The mononucleotide repeats were the most abundant SSR. Most of the mononucleotide and dinucleotides are composed of multiple copies of A/T and AT/TA repeats, respectively, this result is similar to that of previous study on S. alopecuroides 35 . These SSRs of the S. tonkinensis chloroplast genome could be useful biomarkers for genetic diversity studies of wild populations of S. tonkinensis, which will help to formulate effective conservation and management strategies for this important medicinal plant.

conclusions
In conclusion, the chloroplast genome of S. tonkinensis was sequenced on the Illumina HiSeq 2000 platform in this study. SSRs and tandem repeats were identified and 1,760 SSRs were found, most of which were mononucleotides, in the chloroplast genome of S. tonkinensis. SSR analysis can provide valuable information for developing highly variable DNA markers for population genetic surveys and other ecological and evolutionary studies of S. tonkinensis. Further, we performed phylogenetic analysis of 20 chloroplast genomes and collinearity analysis of three closely related species of S. tonkinensis. The contraction and expansion of the IR regions of the three closely related species were also compared. The results of the above analyses provide valuable reference information that will help formulate effective conservation and management strategies as well as molecular identification approaches for this important medicinal plant. Genome assembly and annotation. DNA was randomly fragmented by a Covaris M220 apparatus.
After adding the poly "A" tail, the DNA fragments with desired lengths (400-500 bp) were ligated to adapters and purified using the TruSeq™ DNA Sample Prep Kit for Illumina MiSeq sequencing. Before assembly, raw reads were filtered, and the reads with adapters, reads containing too many uncalled bases ("N" characters, ≥ 10%), the reads showing a quality score below 20 (Q < 20), and the duplicated sequences were removed. The optimized sequence was first assembled by using SOAP de Novo v2.04 software (https ://soap.genom ics.org.cn/) 36 . Second, GapCloser v1.12 software was used to fill the gaps in the assembly results and for base correction. Annotation of the chloroplast genome was conducted using Dual Organellar GenoMe Annotator (DOGMA) software (https :// dogma .ccbb.utexa s.edu/) 37 , and artificial correction was carried out to predict the genes, rRNAs, and tRNAs in the genome. A circular chloroplast genome map was drawn using the OGDRAW program (https ://chlor obox. mpimp -golm.mpg.de/OGDra w.html) 38 .
codon usage analysis. RSCU (Relative Synonymous Codon Usage) was computed from the protein-coding gene sequences of the S. tonkinensis cp genome. The online program CodonW 1.4.2 (https ://codon w.sourc eforg e.net/) was employed for RSCU and codon frequency analysis 39 .