Introduction

Scutellaria baicalensis Georgi (S. baicalensis) is a traditional Chinese medicinal plant that belongs to the family Lamiaceae. As one of the most commonly used Chinese medicinal materials in China, it has been used as medicinal material for more than 2000 years since being first recorded in the Shen-nong-ben-cao-jing (The Classic of Herbal Medicine)1. To date, it has been included in over 90% of TCM formulas for treating colds2. Flavonoids and their glycosides are the main bioactive compounds of S. baicalensis. The main components of root-specific 40 deoxygenated flavonoids are baicalein, baicalin, wogonin, and carboxylase A3,4,5. According to current pharmaceutical investigations, S. baicalensis active compounds exhibit significant pharmacological actions such anti-oxidation, anti-bacterial, anti-viral, anti-tumor, and anti-inflammation6,7,8. And it has been widely used for the treatment of various diseases such as pneumonia, diarrhea, infections, colitis, and hepatitis9,10,11,12,13.

Recently, the Corona Virus Disease 2019 (COVID-19) spread worldwide quickly14. It is recently found that S. baicalensis has significant curative effects on the treatment of COVID-1915,16. It has led to the large-scale cultivation of S. baicalensis in China, and researchers are also actively cultivating new varieties. This study used wild S. baicalensis resources in Dingxi City as raw materials. The single plant optimization method screened varieties with plant agronomic characters, susceptibility, and drug grade indicators. Then we screened three new cultivated varieties of S. baicalensis with high quality and yield through strain identification, strain comparison, multi-point test, and regional production test. But it is not clear which varieties can be widely used. The morphological differences between the three cultivated varieties are mainly due to the difference in flower color (Fig. 1), namely white (SBW), rose (SBR), and purple (SBP). However, it was impossible to distinguish the three cultivated varieties before flowering. In addition, S. baicalensis typically has purple flowers, but a rare white or rose flower phenotype has been cultivated, showing great ornamental potential. Accurate identification of varieties also provides a basis for homozygous breeding.

Figure 1
figure 1

The phenotype of three cultivated varieties (a: SBW, b:SBR, c: SBP).

The chloroplast (cp) is an essential organelle that plays a crucial role in plant photosynthesis and several critical biochemical processes17. Due to its slow mutation rate, abundance within plants, relatively small genome size, and haploid inheritance18. The cp DNA has been widely used in many research fields, such as taxonomic revision, systematic evolution, and species identification19,20. Moreover, previous studies have shown that complete cp genome sequences have been suggested as super barcodes for identification of plants21,22.

The complete cp genomes of three cultivated varieties were sequenced and annotated in this study. To determine the internal differences, we examined the general characteristics and compared the sequence differences. Moreover, we explored the phylogenetic position to decipher the genetic relationship amongst the three cultivated varieties to provide the basis for the variety breeding. The result of this study will provide abundant genetic information on S. baicalensis, and serve as the theoretical basis for expanding its medicinal resources.

Materials and methods

Plant materials

Fresh, healthy leaf tissues of three cultivated varieties of S. baicalensis (SBW, SBR, and SBP) were collected from the Germplasm Resources Nursery of Dingxi Academy of Agricultural Sciences (Dingxi, China, 35°6′38″N, 118°21′48″E) (Fig. 1). The specimens were identified by Professor Zhirong Sun following the taxonomic key and external morphology diagnosis proposed by Flora Reipublicae Popularis Sinieae. The voucher specimens were deposited at the herbal medicine library of the school of Chinese materia medica, Beijing University of Chinese medicine.

DNA extraction and sequencing

The fresh leave of three cultivated varieties was frozen in liquid nitrogen and stored in a − 80 °C refrigerator for DNA extraction. DNA extraction was performed using Plant Genomic DNA Kit (Tiangen Biotech, Beijing) following the manufacturer instructions. Around 20–30 mg of dried tissue or 50–60 mg of frozen tissue was used in each extraction. After DNA isolation, 1 μg of purified DNA was fragmented and used to construct short-insert libraries (insert size 300–500 bp) according to the manufacturer’s instructions (Illumina HiSeq X-Ten) for sequencing.

Cp genome assembly and annotation

The high-quality reads were assembled using GetOrganelle v.4.0 and then annotated by CpGAVAS223. The annotations of tRNA genes were confirmed by using tRNAscan-SE v.2.0324. The Bowtie2 and SAMtools were used to perform mapping the reading to the assembled genome, and evaluate the effectiveness of the assembly results. The cp genomes of SBP, SBR, and SBW were submitted to GenBank at the National Center of Biotechnology Information (NCBI), and the accession numbers were OP837955, OP837956, and OP837957, respectively. Fully annotated plastome circular diagrams were drawn by a website (https://irscope.shinyapps.io/Chloroplot/).

Codon usage

The protein-coding genes were extracted by Phylosuite v.1.2.225. Relative synonymous codon usage (RSCU) and codon usage values were analyzed by CodonW v.1.4.2. Moreover, the RSCU values were shown in a heatmap by TBtools26.

Repeat analysis and comparative analyses

Repetitive sequence analyses were performed using CPGAVAS2 analysis. Tandem repeats were identified using default settings by Tandem Repeats Finder27. The Misa.pl was used to screen the simple sequence repeats (SSRs)28. The scattered repetitive sequences were found by using VMATCH. The REPuter was used to determine the size and location of the oligonucleotide repeats (ORs)29. The complete cp genomes of three cultivated varieties were compared by mVISTA30, and the genome of S. baicalensis (NC027262) was used as the reference sequence for annotation. Sliding window analysis was conducted to assess the nucleotide diversity (Pi) values of the cp genomes by DnaSP v6 (window length = 300 bp, step size = 25 bp). IRscope31 was used to analyze inverted repeated traction and expansion at cp genomes’ junctions.

Identification and validation of barcode for species discrimination

According to the results of DNAsp, we chose the high variation region to distinguish the three cultivated varieties. Primers to discriminate between the three cultivated varieties under study were designed on the variable intergenic regions using Snapgene 6.2.1 (Snapgene from Insightful Science, available at http://www.snapgene.com, last used in 2023). PCR amplifications were performed in a final volume of 20 μL with 10 μL 2 × Taq PCR Master Mix, 0. 5 μM of each primer, 5 μL template DNA, and 4 μL ddH2O following the manufacturer’s instructions (Mei5 Biotechnology, Co., Ltd). All amplifications were carried out in a Pro-Flex PCR system (Applied Biosystems, Waltham, MA, USA) under the following conditions: denaturation at 95 °C for 3 min, followed by 36 cycles of 94 °C for 25 s and 55 °C for 10 s, and 72 °C for 2 min as the final extension following the manufacturer’s instructions (Mei5 Biotechnology, Co., Ltd). PCR amplicons were visualized on 1% agarose gels, purified and then subjected to bidirectional Sanger sequencing on an ABI 3730 XL instrument (Applied Biosystems, USA) using the same set of primers used for PCR amplification with BigDye v3.1 chemistry (Applied Biosystems) following manufacturer’s instructions. All amplifications were repeated twice for each variety.

Phylogenetic analysis and divergence times analysis

Phylogenetic analysis was performed based on 21 complete cp genomes, including the three assembled sequences in our study, 16 cp genomes downloaded from the NCBI (12 Scutellaria, 1 Pogostemon, 1 Ajuga, 1 Lavandula, and 1 Ocimum), and Tulipa gesneriana (NC063831) and Aloe vera (NC035506) as outgroup. A total of 86 shared protein-coding genes were extracted and then concatenated and aligned using MAFFT v7.30732. Subsequently, the alignment was conducted based on Bayesian inference (BI) in MrBayes using the GTR + I + G evolution model33. The parameter was set to run for five million generations and sampled every 1000 generations, with all other settings left at their defaults, and the first 25% of each run was discarded as burn-in. The alignment was also evaluated using bootstrap analysis on 1000 in a maximum likelihood (ML) by RAxML34, with parameters: raxmlHPC-PTHREADS-SSE3 -fa -N 1000 -m GTRGAMMA- x551,314,260 -p 551,314,260 -o Fritillaria_cirrhosa_NC_024728, Fritillaria_thunbergii_NC_034368 -T 20, 1000 replications and best-fit model selection. Besides, Modeltest was used to determine the most appropriate model of DNA sequence evolution for the combined 87-gene dataset. Moreover, MrBayes was run for 5,000,000 generations, sampling, and printing every 500. Two independent MCMC runs using four chains (with the default heating schedule) were conducted per Bayesian analysis. Branch support was calculated from the posterior distribution of Bayesian trees after discarding the first 25% of the trees as burn-in and 1000 ML bootstrap pseudoreplicates.

We used the software MEGA35 for molecular clock analysis on the shared cp protein-coding genes alignment, using fossil information of Arabidopsis thaliana (53–82 million years ago, Mya), Oryza sativa (148–173 Mya), and the family Labiatae (49 Mya)36,37,38. Moreover, another molecular clock tree was constructed based on an ML tree using BEAST39. Phylogenetic inference following MCMC analysis with default settings was performed (20,000,000 generations, Yule speciation tree prior to the substitution rate, the trees sampled every 1000 generations) under a strict clock approach. TRACER software was used to check the acceptability and convergence to the stationary distribution of trees40, while TREEANNOTATOR software was used to generate the maximum clade credibility tree from the obtained trees after setting a burning-in of 10%41. The tree was visualized with FigTree (v. 1.4.4; http://tree.bio.ed.ac.uk/software/figtree/).

Results and discussion

Characteristics of three cultivated varieties

The coverage of three cultivated varieties of cp genomes was even and not zero (Fig. S1). The results indicated that the cp genome splicing results of the three cultivated varieties were correct and there was no heteroplasmy. The size and content of these genomes have been analyzed (Table 1). The cp genome size of SBW (151,702 bp) was the most minor, and SBP (151,876 bp) was the largest. All three cultivars cp genomes of Scutellaria exhibited a typical quadripartite structure (Fig. 2), with two inverse-repeat (IR, including IRa and IRb, 25,261–25,265 bp) regions separated by large single-copy (LSC, 83,878–84,025 bp) and small single-copy (SSC, 17,294–17,330 bp) regions. These cp genomes exhibited identical gene content and type and were generally classified into self-replication, photosynthesis, and other genes (Table 2). A total of 129 genes in these species, including 85 protein-coding genes, 36 tRNA genes, and 8 rRNA genes. The results were identical to the other members of the genus Scutellaria. Compared with most angiosperms, the psbH, rpoA, chIB, chIL, and ycf68 were lost during evolution. The total GC content of three cultivars cp genomes was 38.33% but was unevenly distributed in each region (Table 1). The GC content in IR region (43.61%) was higher than LSC (36.32–36.33%) and SSC (32.61–32.66%). However, the GC content was lower than AT content. These results agree with previous studies of angiosperms, such as the genus of Polygonatum and Epimedium42,43. The circular map of cp genomes was provided for three cultivars in Fig. 2.

Table 1 Summary of the cp genome features for the three cultivated varieties.
Figure 2
figure 2

Cp genome map of three cultivated varieties. Genes lying outside the circle are transcribed in the clockwise direction, while those insides are transcribed in the counterclockwise direction. The colored bars indicate different functional groups. The darker red area in the inner circle denotes GC content, while the orange corresponds to the AT content of the genome. LSC: large single copy, SSC: small single copy, IRA/B: inverted repeat.

Table 2 List of genes present in the cp genome of the three cultivars of Scutellaria.

Additionally, the number and types of introns were similar among the cultivars of S. baicalensis, except for SBW, there is no intron in rpl16. Eighteen genes each contained one intron, including rpl2 (× 2), ndhB (× 2), trnE-UUC (× 2), trnA-UGC (× 2) were located in the IR, and the genes (trnK-UUU, rps16, trnS-CGA, atpF, rpoC1, trnL-UAA, petB, petD, and rpl16) were located in the LSC, and the ndhA was the only present in the SSC region. In addition, the ycf3 and clpP comprise two introns (Table S1). According to the statistics of intron length, trnK-UUU gene has the longest intron in the cp genome of the three cultivated varieties, which is also found in Atractylodes44. In addition, the matK gene was located within the intron of the trnK-UUU gene, which putatively codes for a plastid intron maturase45,46.

Codon usage

The cp genome of three cultivated varieties of S. baicalensis contained 64 codons encoding 20 amino acids. The result of the RSCU revealed that 31 codons were used frequently in these cultivars, with the highest frequency of AGA followed by UUA (Fig. 3). Moreover, the codon exhibited a strong bias toward an A or T at the third position. The codons that contain A/T at the 3′ end mostly have RSCU ≥ 1, whereas the codons are having C or G at the 3′ end mostly have RSCU ≤ 1. Amino acid frequency analyses revealed the highest frequency of Leucine and Iso-leucine, whereas Tryptophane was a rare amino acid. In general, we found high similarities in codon usage and amino acid frequency among the three cultivated varieties, and both contain high AT content. Similar results were found in the cp genome of other angiosperms47,48.

Figure 3
figure 3

The RSCU values of all protein-coding genes for three cultivated varieties. Color key: the red values indicate higher RSCU values, and the blue values indicate lower RSCU values.

Repeat analysis

Our analyses identified SSRs per genome composed of mono- to di- nucleotide repeating units (Fig. 4a). The number and type of SSRs in SBR and SBP were similar, with 25 single nucleotide repeats and 2 dinucleotide repeats. SBW contains only 21 single nucleotide repeats less than SBR and SBP. Moreover, in three cultivated varieties, the main type of mononucleotide repeats was T. Oligonucleotide repeats analyses by REPuter detected two types of repeats: Forward (F) and Palindromic (P). Figure 4b showed that the number of repeats varied in three cultivated varieties. We discovered that 30 repeats in SBW include 14 forward and 16 palindromic, 41 repeats in SBR include 24 forward and 17 palindromic, and 33 repeats in SBP include 16 forward and 17 palindromic. Most of the repeats ranged in size from 30 to 40 bp in three cultivated varieties. This result showed that SBW and SBP were more similar than SBR. We also evaluated the number of repeats about the species' phylogenetic position using the topology in Fig. 8. The results confirmed the random distribution of repeat numbers independent of phylogenetic position.

Figure 4
figure 4

Comparison of repeats in three cultivated varieties. (a) SSR distributed situation in the cp genomes of five species. (b) Long repeats classification of five species. F—forward repeats; P—palindromic repeats.

Comparative cp genomic analysis

The cp genomes of the three cultivated varieties were compared by mVISTA30, and the S. baicalensis (NC027262) was used as the reference sequence for annotation. The Fig. 5 showed that the three cultivated varieties exhibit similar variation sites and degrees of variation. The coding regions (CDS) were more conserved than the intergenic spacers (IGS). The high divergence in IGS were found in rps16-trnQ(UUG), trnQ(UUG)-psbK, psbL-trnS(GCU), trnR(UCU)-atpA, trnT(GGU)-psbD, trnG(GCC)-trnfM(CAU), psaA-ycf3, rps4-trnT(UGU), petA-psbJ, trnF(UGG)-psaJ. Furthermore, some mutations of CDS were found in rps19, rpl16, ycf2. These high variation region sequences could be used to distinguish wild species from cultivated species. Moreover, the result showed that IR regions had lower sequence divergence than LSC and SSC regions.

Figure 5
figure 5

Comparison of three cultivated varieties cp genomes using S. baicalensis (NC027262) annotation as a reference. The vertical scale indicates the percentage of identity, ranging from 50 to 100%. The horizontal axis shows the coordinates within the cp genome. Genome regions are color-coded as exons, introns, and intergenic spacer (IGS), and the Gray arrows indicate the direction of transcription of each gene. Annotated genes are displayed along the top.

In order to explore the sequence divergence between the three cultivated varieties, nucleotide diversity (Pi) was estimated to indicate the variability of potential plastid regions. The values of Pi ranged from 0 to 0.01 (Fig. 6). Among them, 4427–5018 bp region showed high nucleotide diversity (Pi: 0.0067–0.0089). This region was identified as an IGS in matK-rps16. Besides, another high variable region (Pi: 0.0067) appears at 63,718–64,092 bp, located at petA-psbJ. Therefore, the complete cp genome could be used as a super-barcode to identify the three cultivated varieties.

Figure 6
figure 6

Sliding window analysis of the entire cp genome of three cultivated varieties (window length: 300 bp; step size: 25 bp). X-axis: position of the window; Y-axis: nucleotide diversity of each window.

Inverted repeats contraction and expansion

The inverted repeats contraction and expansion revealed variation at LSC/IRs/SSC junctions. The types of junctions in three cultivated varieties and S. baicalensis (NC027262) were different (Fig. 7). In all species, a truncated copy of the rps19 gene was found at the IRb/LSC junction; the rpl22 gene was found entirely in the LSC region; and the rpl2 gene was found entirely in the IRb region. Another truncated copy of ndhF gene was found at the junction of IRb/SSC in all species, which starts in IRb regions and integrates into the SSC region. Interestingly, compared with the three cultivated varieties, the ndhF gene of S. baicalensis was longer in IRb. Moreover, a truncated copy of ycf1 was found in SSC/IRa junction, which was longer in IRa of S. baicalensis. In three cultivated varieties, trnN was observed to present entirely in the IRa region, and the trnH was completely exists in LSC and only one bp from the junction of IRa/LSC. In comparison, the trnH gene of S. baicalensis was 178 bp from the junction of IRa/LSC. These results show that the cp genome of three cultivated varieties displays a unique IR contraction compared to the wild species.

Figure 7
figure 7

Comparison of quadripartite junction sites in three cultivated varieties cp genomes. Gene transcribed clockwise are presented below the track, whereas transcribed counterclockwise are presented on top of the track. The start and end of each gene from the junctions have been shown with arrows. The T scale bar above or below the track shows genes integrated from one region of the cp to another. JLB (IRb/LSC), JSA (SSC/IRa), JSB (IRb/SSC), and JLA (IRa/LSC) denotes the junction sites between the quadripartite regions of the genome.

Specific DNA barcode maker design for three cultivated varieties

To discriminate the three cultivated varieties, we selected 4427–5018 bp hypervariable regions, matK-rps16, to develop a barcode in which primer sequence F (forward, 5′–3′): GAATTTCAATTTAACAATGCAATAATA and R (reverse, 5′–3′): ATATTTTTTTGAATTCTGAC. PCR amplification of total DNAs from all five medicinal species samples resulted in products having the expected size (Fig. S2). The DNA fragments were extracted from each band and then subjected to Sanger sequencing. The sequencing results were identical to the expected sequences (Fig. 8). The barcode has a specific SNP loci and one Indel loci. These two variable loci can be used to differentiate three cultivated varieties.

Figure 8
figure 8

Sequencing chromatograms of the barcode regions from SBW1, SBW2, SBR1, SBR2, SBP1 and SBP2, with consensus sequence and alignment.

Phylogenetic analysis and divergence times analysis

Each subfamily in the Labiatae formed a monophyletic clade. Scutellarioideae, Lamioideae, Ajugoideae, Lavanduloideae, Ocimoideae were sister groups to each other. This result is consistent with previous genetic studies49. The Scutellaria belongs to the Scutellarioideae subfamily. Moreover, the Flora of China classifies Scutellaria into Subgen. Scutellaria, Subgen. Anapis and Subgen. Scutellariopsis. However, the results of this study do not support such a classification. The BI and ML phylogenetic trees (Fig. 9) and phylogram (Fig. S3) revealed that SBP was more closely related to SBW, in the three cultivated varieties, which was also consistent with the result of the oligonucleotide repeats analysis. In addition, three cultivated varieties together with S. baicalensis (NC027262). They formed a strongly supported sister relationship with S. rehderiana (NC060314) and clustered into one branch, and then, with S. amoena (NC057255) and S likiangensis (NC061416) cluster together. This finding was consistent with the previous studies50. The closely related plants may possess similar chemicals and have the same pharmacological properties. Moreover, plants are phylogenetically related to each other. Therefore, ethnobotanists have used a range of phylogenetic methods for bioprospecting51. According to previous research, the main pharmaceutical active ingredients of S. baicalensis are flavonoids, glycosides and aglycones52,53. Modern pharmacological studies show that the active ingredients of S. baicalensis have anti-bacterial, anti-tumor, anti-oxidation, anti-viral, and anti-inflammation properties6,7,8. These results provide new ideas for the exploitation of S. baicalensis. The cp genomes seemed to provide more solid support for the reconstruction of phylogenetic relationships among these sections.

Figure 9
figure 9

BI and ML phylogenetic tree based on 87 cp genes of the 21 species. The bootstrap support values are listed at each node.

The molecular clock trees were calibrated by MEGA and BEAST with fossil record data of A. thaliana-O. sativa (Figs. 10 and S4). Ocimum basilicum and Lavandula angustifolia as root species of Labiatae with a divergence time estimated at 49.00 Mya (Fig. 10). The monophyletic group of the Scutellaris genus diverged at about 38.95 Mya. In a previous study, the divergence time of S. baicalensis based on genome sequence was approximately 13.28 Mya4. While based on the matK and CHS genes, the divergence time of S. baicalensis and S. salviifolia was approximately 1.37 Mya54. This study confirmed and traced the divergence time of S. baicalensis and three cultivated varieties, which occurred at 0.11 Mya and 0.10 Mya, later than previously reported. The differences in divergence time between three cultivated varieties and S. baicalensis are likely due to the influence of the amount of data and hybridization55,56.

Figure 10
figure 10

Divergence times tree obtained from a molecules clock analysis using the MEGA software. The node ages are given for each node.

Conclusions

In this study, the cp genome of the three cultivated varieties of S. baicalensis were sequenced and assembled. A comparative analysis with other genomes was also performed. S. baicalensis is one of the most commonly used Chinese medicinal materials in China. The study of cp genome can provide more biological information for the sustainability of S. baicalensis. Overall, the three cultivated varieties of S. baicalensis cp genomes had similar structures and gene compositions. However, the sliding window results show significant differences among the three cultivated varieties in matK-rps16 and petA-psbJ. Therefore, the complete cp genome could be used as a super-barcode to identify the three cultivated varieties. Moreover, we verified that the matK-rps16 sequence can be used as a barcode for the identification of three varieties. We reconstructed a phylogenetic tree by complete cp genomes. The result indicated that S. baicalensis and S. rehderiana are closely related. The results provide new ideas for the exploitation of S. baicalensis. In addition, the divergence time analysis showed that the three cultivated varieties diverged at about 0.10 Mya. Overall, these results can provide species identification and biological information and contribute to the bioprospecting and improvement of ornamental value.

Sample collection and experiment statement

All the methods including plant leaves collection and experiment were carried out in accordance with relevant national/international/legislative and institutional guidelines and regulations.