Introduction

In keeping with APG IV, Liliales are made up of about 1500 species divided into 10 families that have been categorized by various authors1. The Liliales order is constantly being studied botanically, as variously amended family circumscriptions2,3,4. Some of the changes in Liliales, for example, include4,5,6: (1) Petermanniaceae has been recognized as a family, while it was within Colchicaceae; (2) Luzuriagaceae has been placed in Alstroemeriaceae; and (3) Corsiaceae has recently been placed in Liliales. Liliaceae contains 15 genera and around 900 species7. The family’s classification has shifted significantly as a result of modern molecular phylogenetic analysis1,8,9. Lilium is a genus in Liliaceae, which contains circa 100 species10. This genus classification has historically been obscurant. Based on morphological characteristics, several classifications of Lilium have been suggested. This genus was classified into seven sections relying on 13 morphological characteristics11, which have primarily been applied to distinguish Lilium species and Lilium phylogeny exploration12. Nevertheless, this type of species classification based on morphological traits is ofttimes dynamic and untrustworthy and is frequently influenced by environmental circumstances; that is why there is some disagreement about Comber’s classification. Therefore, the Comber’s classification verification has been checked by molecular phylogenetic investigations13,14,15,16. Additionally, Kim et al.17, Du et al.18, and Kim et al.12 evaluated Lilium phylogeny using 9, 16, and 28 species, respectively, which somewhat resolved the phylogenetic relationships, but some ambiguities still remain. For example, due to sampling restrictions for Lophophorum and Nomocharis, the position of L. distichum has not been clarified enough. Kim et al.12, according to their results, pointed out that the position of L. candidum is uncertain, and more sampling is required to resolve it.

Furthermore, due to the sampling and hybridization of these species to produce today’s lilies, understanding wild lilies is essential for achieving breeding program goals19. L. ledebourii, called Susan-e Chelcheragh (SCh) in Persian, is a rare species in the genus Lilium. It has only been seen in Iran and Azerbaijan. Due to uncontrolled grazing and poaching, it is extremely endangered, now being protected only in a small area of Damash village20,21. L. ledebourii exhibits valuable traits, including attractive white flower20, a high number of flowers, usually 2–1522 and even up to 24 (personal observations of the author, 2016), the sweet fragrance23 quoted by24, an excellent vase-life, a vigorous growth, a good tolerance to low light density and low temperatures25. These make it no less beautiful than the commercial species. However, despite being highly valuable and, more importantly, endangered, has not only little effort has been made to use the species in population genetic studies, but also its rightful position among lilies is unclear.

DNA barcoding is one of the most efficient methods for characterizing and classifying various organisms at the species and genus levels26. One of the research hotspots for DNA barcode screening is chloroplast (cp), which can be employed as a super-barcode to solve the classification problem in phylogenetic studies and species identification27,28. Chloroplasts, the energy generators of plant cells, ensure life on the earth29,30. This vital organ serves as a signaling hub in the cell, releasing a diverse range of signals that adjust a well-regulated and proper reaction to any condition31. The contents, structure, and gene organization of a chloroplast genome are more strictly conserved than a nucleic genome32. It has a much lower substitution rate than a nucleic genome, and the substitution rate is even lower in two inverted repeat regions33. Noteworthy, information included in chloroplast genomes, as well as their almost nonrecombinant traits34, maternal transmission35, have made the chloroplast genome a good source for searching for clues about the origins of populations as well as for phylogenetic reconstructions, thereby clearing the ambiguities present in the evolutionary relationships36. Hence, today's strategies for discovering plant molecular phylogeny rely profoundly on cp genome sequences. Additionally, because of advances in next-generation sequencing technology, chiefly third-generation sequencing such as PacBio, leaning on single-molecule real-time (SMRT), which produces reads > 10 kb, the decoding of chloroplast genomes has been accelerated37,38.

Other advantages of chloroplasts include using mutation hotspot sites, and the single sequence repeats to aid population genetics and species identification39. The selection pressure that a species face during evolution is another fascinating aspect of chloroplast genome analysis, which divulged the impact of various environmental pressures on cp genomes when it comes to long-term evolution40. Recent studies have discovered a slew of positive selection genes e.g., the accD in Ipomoea41, and petG, rpl36, and atpB in Aquilegia42, as well as purifying selection at cp genome-scale in Stauntonia43, but we know very little about it in Lilium at the gene-level and nothing at the cp genome-level.

This study, employing PacBio platform, reports the whole chloroplast genome of L. ledebourii, a precious endangered species. In addition, using the genome data, we conducted a multi-scale genome-level analysis among this species, newly unemployed (at the genome-scale) species, and other Lilium species. In particular, for the first time, we presented a comprehensive analysis of the selection pressure between Lilium species at both the gene-level and genome-level. This study covered overlooked topics in previous studies as much as possible. Lastly, employing the richer taxon sampling, we rebuilt a new phylogenetic tree for Lilium based on the whole cp genome, and we believe that we have given more resolution to the Lilium classification than in earlier studies. This article may be the primary cornerstone for future molecular studies and genetic improvement of L. ledebourii.

Results

The chloroplast genome features of L. ledebourii

The L. ledebourii cp genome, as typical, showed the ordinary quadripartite and protected structure with 151,884 bp in length. The two regions of LSC (81,412 bp) and SSC (17,620 bp) were present in the L. ledebourii genome, separated by two inverted repeats (26426 bp), IRa and IRb (Fig. 1). L. ledebourii cp chloroplast has 37.0% GC content, with IRs having the highest (42.5%) and SSC having the lowest (37.5%). Overall, 113 different genes were recognized in the SCh chloroplast genome, consisting of 30 distinct tRNA genes, four distinct ribosomal RNAs (4.5S, 5S, 16S, and 23S) genes, and 79 unique protein-encoding genes (Fig. 1, Table 1). Primarily based on their features, all genes are classified into five main categories (Table 1). Among 113 genes, most of the genes happen without another copy in the LSC or SSC regions, whereas 20 are copied within the IR regions. In addition, 18 genes in the L. ledebourii chloroplast genome contained introns, including 12 proteins encoding genes (rps12, rps16, rpl2, rpl16, petB, petD, ndhA, ndhB, clpP, ycf3, rpoC1, and atpF) and 6 tRNAs (trnG-UCC, trnL-UAA, trnK-UUU, trnV-UAC, trnA-UGC, and trnI-GAU), of which 18 genes with a single intron and two genes (clpP and ycf3) with two introns. infA was interpreted as a pseudogene. Trans-splicing was observed in the rps12 gene, with the 5′ exon positioned in the LSC region and the intron and 3′ exon positioned in the IR regions (Fig. 1, Table 1).

Figure 1
figure 1

The chloroplast genome map of L. ledebourii. Transcriptional directions are represented on the circle's inside (clockwise) and outside (counterclockwise). Genes are color-coded according to their functional groups.

Table 1 Gene content and functional classification of L. ledebourii chloroplast genome.

Genome comparison: boundaries regions and divergence hotspot

The Lilium cp genomes size differed between 151,655 bp in L. bakerianum and 153,235 bp in L. fargesii. The expansion and contraction variability in IR/SC junction regions, which were typical phenomena in the plant species evolutionary scrutiny, were evaluated by comparing the border regions and adjacent genes of Lilium chloroplast genomes. In this study, the chloroplast genomes of Lilium species demonstrated slightly visible junction variation in the IRa/LSC and IRb/SSC boundaries, despite the gene number and gene content being conserved (Fig. 2).

Figure 2
figure 2

Comparison of the junction positions of LSC, SSC, and IR regions among the among 48 Lilium cp genomes. The red identifiers represent the GenBank accession number of each species.

In all Lilium cp genomes, IR regions were found to have nearly the same size (26,382 bp to 26,990 bp). L. fargesii had the largest IR region expansion, which ended at the rps19 gene. In all Lilium cp genomes, the ycf1 gene was placed in the IR/SSC boundary regions, leading to an incomplete gene duplication within IRs, and the IR/LSC junction site was placed in the rps19 gene. In most Lilium cp genomes, the rps19 gene was observed in the IRb region around 137 bp to 142 bp far from the JLB boundary, except for L. distichum (6 bp); L. gongshanense and L. nepalense (22); L. henricii, L. meleagrinum, and L. rosthornii (23 bp); and L. amoenum and L. primulinum (31 bp): it was found significantly lower than the other species. The ndhF gene in most cp genomes (39/48) was positioned inside the SSC location. However, in nine species, including L. bakerianum 25 bp, L. fargesii 55 bp, L. gongshanense 33 bp, L. lophophorum 19 bp, L. pardalinum 44 bp, L. philadelphicum 22 bp, L. regale 4 bp, L. superbum 45 bp, and L. washingtonianum 45 bp, the ndhF gene was discovered somewhat extended in the IRb (Fig. 2).

In the Lilium species, the IR region was more conserved than LSC and SSC regions. Synteny results revealed that the Lilium species have a high degree of sequence identity and collinearity at the cp genome-wide scale, especially in the IR region (Fig. S1). We also conducted genome-wide analysis via sliding window assessment to detect hotspot regions in the Lilium cp genomes. Nucleotide diversity (pi) was averaged at 0.00504, ranging from 0 to 0.01913. The variability of rpl32-trnL-ccsA, petD-rpoA, ycf1, psbI-trnS-trnG, rps15-ycf1, trnR, trnT-trnL, and trnP-psaJ-rpl33 were higher among the 48 Lilium cp genomes. The divergence was more prominent in the SC regions than in the IRs regions, which displayed a higher nucleotide variability compared to IR regions (Fig. 3).

Figure 3
figure 3

Sliding window analysis of 48 Lilium cp genomes (window length: 600 bp—step size: 200 bp). The X-axis and Y-axis represents the position of a window and nucleotide diversity (Pi) of each window, respectively.

SSRs and complex repeat analysis

In population genetic studies, the number and position of repeated DNA motifs (with 1–6 nucleotides) have been routinely employed for the detection of polymorphisms in cp genomes44. We discovered SSRs in the cp genomes of SCh and 47 closely related species. 64 SSRs were found in the SCh cp genome, mostly (60.94%) made up of mononucleotide repeats. In addition, only one pentanucleotide SSR pattern was observed in SCh (Fig. 4A). As shown in Fig. 4C, among the 48 Lilium cp genomes, SSRs ranged from 53 (L. superbum) to 81 (L. pardanthinum). In total, 3234 microsatellites were detected in 48 cp genomes of Lilium (Fig. 4C), with mononucleotide SSRs (57.48%) being the most common, whereas di-, tri-, tetra-, penta-, and hexa nucleotide SSRs accounted for 17.56%, 7.58%, 14.78%, 2.35%, and 0.25% of all SSRs, respectively (Fig. 4B). The number of mono-nucleotide repeats in the 48 Lilium cp genomes varied from 25 (L. distichum) to 50 (L. fargesii). Hexanucleotide repeats were only observed in the cp genome of L. henricii (AACTAG/AGTTCT), L. leichtlinii (AAATAT/ATATTT and ACTCAT/AGTATG), L. sp_KHK-2014 (ACGTAT/ACGTAT), L. tsingtauense (ACGTAT/ACGTAT), L. meleagrinum (AACTAG/AGTTCT), L. pardalinum (AATAGT/ACTATT), and L. sargentiae (AAATTC/AATTTG). Overall, varied SSR motifs were found in Lilium cp genomes at different frequencies. This research distinguished the presence and SSR types of Lilium species, which might be fruitful for molecular marker investigations and population genetics of Lilium, especially for L. ledebourii (Table S1, Fig. S2).

Figure 4
figure 4

The type and distribution of simple sequence repeats (SSRs) and complex repeat in the 48 Lilium cp genomes. (A) Frequency and type of SSRs in the L. ledebourii cp genome. (B) The number of SSR types discovered in 48 Lilium cp genomes. (C) The percentage of SSrs types in 48 Lilium cp genomes. (D) The number and of complex repeats types in 48 Lilim cp genomes. (E) Frequency of complex repeats by size.

Complicated repeat sequences play a role in the recombination and variation of chloroplast genomes45. The SCh chloroplast genome contains 32 complex repeats, including five tandem, ten dispersed, and 17 palindromic repeats (Table 2). These repeats were at least 24 bp in length, with the longest being 53 bp. Furthermore, it was discovered that the final quantity of complex repeats in the SCh genome was around 25% lower than the average number of repeats in Lilium genomes, with a decrease of 18%, 32%, and 22% for tandem, dispersed, and palindromic repeats, respectively (Fig. 4D). In total, 29–55 long repeat sequences were discovered in each Lilium cp genome, including 9–23 dispersed repeats, 14–30 palindromic repeats, and 3–10 tandem repeats (Fig. 4D). With the exception of L. longiflorum, which contained a repetition of 162 bp, repeat sizes varied from 30 to 87 bp in dispersed, 30 to 61 bp in palindromic, and from 15 to 85 bp in tandem. In other words, out of the 48 species, palindromic repeat was the most common type, and the total number of repeats ranged from 29 to 55, with 60.86% of these repeats being between 30 and 40 in length (Fig. 4E).

Table 2 Dispersed and palindromic repeats by positions in the cp genome of L. ledebourii.

Codon usage bias analysis

Due to the widespread occurrence of synonymous codon bias in organisms, recognizing codon preference might play a significant role in the evolution by clarifying the selection pressure and improving translation efficiency by utilizing major codons46,47. Totally, 21,989 codons were detected in the SCh protein-coding genes. A- and U-ending are seen to be more prevalent than G and C-ending ones. Among SCh amino acids, the highest and lowest frequencies were related to leucine (Leu = 2268) and cysteine (Cys = 255), respectively. In SCh, 30 codons showed more bias (RSCU > 1), and 31 codons displayed bias: RSCU < 1. In addition, there was no bias (RSCU = 1) in the frequency of start codons AUG (methionine), UGG (tryptophan), and AUA (isoleucine) (Table 3).

Table 3 The Relative synonymous codon usage (RSCU) of L. ledebourii protein-coding genes.

Comparing the protein-coding genes in 48 Lilium cp genomes, we found that each species was composed of 20,691–22,781 triplet codons in protein-coding genes. Leucine (10.18–10.34%) was the most abundant among encoded amino acids in all of the species studied, whereas cysteine (1.13–1.24%) was the least abundant (Table S2, Fig. 5). Among the 20 amino acids, the lowest and highest RSCU values were recorded for Tyr-UAC, encoding the tyrosine, and Leu-CUU, encoding the leucine, respectively. Codon usage in the Lilium cp genomes was biased towards A- and U-ended codons, according to RSCU values (RSCU > 1). In addition, the pattern of codon usage bias in the Lilium species was investigated. Figure S3 shows the values of the Codon adaptation index (CAI), Codon bias index (CBI), Frequency of optimal codons (FOP), Effective number of codons (NC), and GC3s for 48 Lilium chloroplast genomes. We observed that five parameters associated with codon usage bias are very similar across Lilium species (Fig. S3).

Figure 5
figure 5

Codon distribution of protein-coding genes among Lilium cp genomes. Color code: Red denotes a higher RSCU and blue denotes a lower RSCU.

Selection pressure on Lilium cp genomes and adaptation evolution

In this study, the Ka/Ks ratio was computed for the 78 protein-coding genes shared by all 48 cp genomes (Table S3). Ks = 0 may lead to swelling of Ka/Ks ratios or misidentification of genes with powerful positive selection because of high, unlimited, or unspecified Ka/Ks ratio. To put down this difficulty, we eliminated the comparisons with Ks = 0 from all analyses. The average Ka/Ks value of the 78 protein-coding genes examined among 48 cp genomes was 0.2472. Among the genes, ndhB, psbJ, psbZ, and ycf2 exhibit zero synonymous substitution (Ks = 0), revealing divergence of species or genetic restriction. The rps7, atpH, petN, psaI, psaJ, psbF, psbL, psbT, and psaC were the highly conserved genes, with an average Ka/Ks ratio of 0, indicating extremely purifying selection pressure (Table S3). We found that accD, rpl16, and rpl36 with Ka/Ks average of 1.268, 1.052, and 1.200, respectively, have been subjected to positive selection in the Lilium cp genomes. The matK, petB, petD, rps4, rps12, and ycf1 had average Ka/Ks ratios in the 0.5 to 1 range, reflecting calm selection. The Ka/Ks average for the rest genes was recorded less than 0.49, reflecting that about 72% of genes (56/78) in the Lilium cp genomes were subject to purifying selection. Although the average Ka/Ks > 1 was recorded only for the accD, rpl16, and rpl36, for 26 genes, Ka/Ks > 1 was observed in at least one pairwise comparison (from 1128 pairwise comparisons). The gene ycf1 possesses 248 positive selective pairwise comparisons, followed by matK (144), ccsA (52), rbcL (37), ndhI (31), clpP (29), atpF (20), rpoC2 (20), cemA (16), ndhD (14), ndhF(10), petB (10), ndhA (9), ndhG (9), petD (9), ycf4 (8), rpoA (7), ndhJ (4), rpl33 (3), rps14 (3), ndhH (2), petG (2), rps4 (2), ndhC (1), rpoB (1), and rps2 (1).

Also, another analysis to compare Lilium species based on “concatenate all protein-coding genes” was performed and obtained 1128 pairwise comparison outcomes of Ka/Ks values. The pairwise Ks value of zero was seen between L. matangense and L.xanthellum. Ka/Ks > 1 was only obtained between L. speciosum and L. distichum with a Ka/Ks ratio of 1.375. The Ka/Ks = 1 was recognized between L. amabile and L. pumilum, L. henricii and L. meleagrinum, L. sulphureum and L. henryi, and L. lancifolium and L. pumilum (Ka/ks = 1.062). Ka/Ks = 0 among all comparisons, only was recorded between L. sp._KHK-2014 and L. tsingtauense. Overall, the average pairwise Ka/Ks value was 0.4839 (Fig. 6), indicating that at the whole cp protein level, Lilium species were subjected to a purifying selection.

Figure 6
figure 6

Ka/Ks ratios between Lilium cp genome pairs. In the multigene nucleotide alignment, the heatmap depicts pairwise Ka/Ks ratios between each concatenated single-copy CDs sequence.

Phylogenetic analysis

Both analyses, which used the complete chloroplast genome (CCGs) and protein coding genes (CDSs), divided 47 Lilium species into two main groups. Although the members of both main groups were the same in both topologies, there were slight differences between them (Figs. 7, S4). Because of the high synteny among Lilium cp genomes, this study concentrated on phylogenetic analysis employing whole cp genome sequences to inspect relationships across the 47 Lilium species. Maximum likelihood was employed with two species serving as outgroups. According to the topology, the majority of nodes were highly supported. In all, 34 of the 45 nodes acquired a maximally supported (value ≥ 99%) value bootstrap. According to the CCG topology, the 47 Lilium species were divided into two main groups consisting of 11 clades (Fig. 7).

Figure 7
figure 7

The phylogenetic relationships of Lilium species employing whole cp genome sequences. Fritillaria hupehensis and Fritillaria cirrhosa were applied as outgroups. Phylogenetic tree were constructed by Maximum likelihood (ML). The ML bootstrap values are represented by the numbers above the branches.

Group 1: included 17 species, which were placed in four clades and four self-sufficient lineages. Clade I consisted of six Sinomartagon species including L. callosum, L. amabile, L. cernuum, L. lancifolium, L. pensylvanicum, and L. primulinum. Clade II is composed of two Martagon species (L. tsingtauense and L. hansonii). Clade III consisted of two Leucolirion (L. formosanum and L. longiflorum), and L. brownie belong to Archelirion. L. candidum of Liriotypus and L. ledebourii established a monophyletic clade IV. L. davidii was the sister of Claude II. L. martagon, L. leichtlini, and L. bulbiferum had more distant connections with claude I Sinomartagons (Fig. 7).

Group 2: included 30 species, which were placed in seven clades and three self-sufficient lineages. Clade V was composed of four Sinomartagon species including L. bakerianum, L. taliense, L. primulinum, and L. nepalense. Clade VI consisted of two Sinomartagon (L. henricii and L. souliei), and three species (L. pardanthinum, L. gongshanense, and L. meleagrinum), which have recently been transferred from the genus Nemacaris to the genus Lilium. L. speciosum, L. japonicum, and L. distichum formed calde VII. Five species of Leucolirion including L. regale, L. leucanthum, L. sulphureum, L. sargentiae, and L. henryi, created clade VIII. Clade IX accommodated L. duchartrei and L. lankongense. Three species of Lofoforum (L. fargesii, L. lophophorum, and L. matangense), along with L. nanum and L. xanthellum, were included in clade X. Three species (L. superbum and L. washingtonianum, and L. pardalinum) belonging to Pseudolirium formed clade XI. L. amoenum, L. rosthornii, and L. philadelphicum were sisters to the clade V, clade VIII, and clade X, respectively (Fig. 7).

Discussion

The conserved characteristics of gene content and organization, as well as the GC content of the SCh cp genome, were found to be similar to the variability in other species17,18. Furthermore, trans-splicing was observed in the rps12 gene, which is also seen in other species48. The Lilium cp genomes length differed between 151,655 bp in L. bakerianum and 153,235 bp in L. fargesii. It was suggested that one of the primary causes for the change in the cp genomes size is that the IR region shrinks, expands, or losing49.

We surveyed the expansion and contraction variability in IR/SC junction regions. The boundary of LSC/IRb is stable, while slightly visible junction variation can be seen in the IRa/LSC and IRb/SSC boundaries. The occurrence of contraction and expansion of IR regions during evolution is a relatively common happening, which has been employed as evolutionary loci for phylogenetic studies50,51. Expansion of IR/LSC junctions to rps19 has been observed in other Liliaceae species such as Amana, and this event appears to be a Liliaceae ancestral symplesiomorphy52. The contraction/expansion of IR regions in Liliaceae has resulted in the formation of ycf1 and rps19 at the boundaries across SC and IR regions, with varying lengths, as demonstrated in Fritillaria53 and Cardiocrinum54. Given the unison of our findings with those about other plants52,53,54, expansion and contraction of IR regions may be a significant mechanism for different lengths of 48 Lilium cp genomes.

Analysis of nucleotide diversity and cp repeats can be used to recognize molecular markers, rebuild evolutionary connections, and delve into population genetics55,56. In this study, a total of 3234 microsatellites were discovered in 48 cp genomes of Lilium. A/T repeats were the most common type of SSR found. The abundance of this SSR type is consistent with the majority of other cp genomes explored thus far57. Additionally, complex repeats in the 48 Lilium were found, which could be substantial genomic reconfiguration hotspots58,59. The number and size of tandem, dispersed, and palindromic repeats were nearly identical in the cp genomes of relevant species such as Fritillaria60. Especially, the incidence of large repeats in the chloroplast, such as the 162 bp tandem repeat of L. longiflorum from our results, has probably been linked to an unstable genomic structure because of improper rearrangements61.

Here, a higher divergence of the SC regions and a lower divergence of the IR were discovered, suggesting that the IR is more conservative than in other regions, with the same characteristics of most angiosperms62. This occurrence is caused by copy correction of IRs and the removal of harmful mutations via gene conversion63. rpl32-trnL-ccsA64, trnP-psaJ-rpl33, petD-rpoA, ycf118, psbI-trnS-trnG65, and rps15-ycf1, trnT-trnL66 have previously been identified as high variability regions in various species. The phylogeny of genus Lilium will be clarified with the help of these regions, which are expected to be very helpful in the future.

The rate of Ka (non-synonymous) to Ks (synonymous) nucleotide substitution is commonly employed as a powerful tool for the clarification of the evolution of protein coding genes and also species adaptive developments67,68. The Ka/Ks ratio determines the gene divergence grade and whether selection pressure is positive (Ka/Ks > 1), purifying (Ka/Ks < 1, particularly if it is less than 0.5), or neutral (Ka/Ks = 1)67.

In the present study, the Ka/Ks > 1 was recorded for accD, rpl16, and rpl36, implying that these genes could be important in the adaptive evolution51. Positive selection of accD was signed by the important role of the gene in stress tolerance and resistance, insect predation, and pathogens69. Positive selection in accD has been observed in Ipomoea41 and Stauntonia43. Other positively selected genes were rpl16 and rpl36, which are responsible for encoding the ribosomal protein, which has been evidenced to be necessary for the development of chloroplast ribosomes in plants70. Previous studies have reported positive selection for rpl16 in Lonicera71 and rpl36 in Aquilegia42.

However, we discovered that some genes were positively selected in at least one pairwise comparison, suggesting these genes were potentially subject to positive pressure for selection among Lilium species. ycf1 possesses 248 positive selective pairwise comparisons, followed by matK (144), ccsA (52), rbcL (37), ndhI (31), clpP (29), atpF (20), rpoC2 (20), cemA (16), ndhD (14), ndhF(10), petB (10), ndhA (9), ndhG (9), petD (9), ycf4 (8), rpoA (7), ndhJ (4), rpl33 (3), rps14 (3), ndhH (2), petG (2), rps4 (2), ndhC (1), rpoB (1), and rps2 (1). Of these, MatK has previously been found under positive selection in over 30 different taxonomic groups72. NADH-dehydrogenase gens group (ndh) were fundamental in the use of light energy and the electron transfer chain to produce ATP, significant components for photosynthesis73,74. As a consequence, these genes, as important components involved in plant growth, may have evolved as a result of more common substitutions in order to adapt to various environmental conditions43. We discovered positive selection on the ycf1 gene in 248 pairwise comparisons. The ycf1 is huge open reading frame, which encodes protein products for many amino acids in higher plant. Moreover, the necessity of the ycf1 gene for cell survival has been proven by knockout studies75. clpP, which encodes the (ATP-dependent) clp protease, is thought to play a role in chloroplast protein transformation and may be necessary for shoot development in the presence of the degradation of clpP-mediated protein76,77. Another positively selected gene in our study is the rubisco large-chain gene (rbcL). In many higher plants, rbcL positive selection has been made78. atpF is involved in the encoding of the H+-ATP subunits, which is necessary for some photosynthetic processes79. Positive selection has been noticed to evolve rpo genes, which encode proteins role-playing in the modification of transcription and post-transcriptional modification80. Due to the cooperation of cemA with nuclear genes81, cemA might evolve relatively quickly in species82. Substitution of amino-acid, indel presence and prematurity in stop codon could lead to a positive selection of ccsA83.

Selected positive genes may have had crucial roles in the adaptation of Lilium to different environments. Gao et al.84 documented the adaptation of chloroplast genes to various ecological environments of solar preferences. Moreover, more undiscovered selective compulsions may be involved in the increasing of the Ka/Ks ratio, leading to species divergence85. The Ka/Ks ratios in the majority of genes shared by Lilium chloroplast genomes and among pairwise comparisons of species employing all protein-coding genes were less than 1, proposing purifying selection. Similar findings were detailed for Gentiana species86. This lower rate of Ka/Ks can be the result of the fact that most of the species are probably to undergo disadvantageous nonsynonymous substitutions and purification selections, and the selective restriction on nonsynonymous substitutions is stronger than synonymous substitutions87,88. In short, positive selection in some genes likely enriches the Lilium variety and adaptability.

At the whole cp protein scale, Lilium species were subjected to a purifying selection. Sunlight UV radiation damages and rearranges DNA89,90, and higher temperatures speed up metabolism91, all contributing to an increase the mutation rates. Consequently, purifying selection, as one of the most common types of natural selection, constantly helps in the elimination of disadvantageous mutations in populations. Purifying selection would thus be an evolutionary outcome of the preservation of Lilium species adaptive habits.

The richer the taxon sampling, the more accurate it is to comment on the Comber’s classification. The present study clustered 47 species of Lilium. To date, some of them have not been evaluated at the whole cp genome level. Lilium species were distributed into 11 clades divided into two main groups. The position of the species in the present topology is consistent with the classification by Kim et al.12.

Our samples did not support the monophyly of Comber’s sections11. According to our results, L. martagon was placed farther away from the Martagon species and was closer to L. leichtlinii, L. bulbiferum, and L. davidii, which agrees with Gong et al.16 based on nrITS. We observed L. brownii of Comber’s Archelirion far from the other two species of Archelirion) L. japonicum and L. speciosum( and close to the two species of Leucolirion (L. formosanum and L. longiflorum). Similarly, Li et al.9 classified L. brownii alongside L. formosanum and L. longiflorum in a genus-level study of the Liliaceae family's evolution. As our results warn, and with the help of ITS-dataset classification92,93, now with the approval of cp genome-based classification, L. brownii can be moved from the Archelirion to the Leucolirion.

Based on the morphology, L. duchartrei, L. lankongense, L. nanum, and L. rosthornii belong to the Sinomartagon section11. However, based on our cp genome-scale topology, these species are further away from the Sinomartagons and are closer to the Lophophorum species. Studies have shown that L. nanum and Lophophorum have similar karyotypes94. Additionally, according to the ITS regions, Du et al.92 reported that L. duchartrei and L. lankongense are in the same clade and are closer to the Lophophorum. We accommodated L. xanthellum on Clade X, away from Leucolirion. According to ultrametric chronograms, L. xanthellum is closer to L. lophophorum and L. matangense95. Totally, the composition of clade X in our phylogeny (three species of Lophophorum including L. fargesii, L. lophophorum, and L. matangense, along with L. nanum of Sinomartagon and L. xanthellum of Leucolirion) is in agreement with Du et al.92 based on ITS regions.

Although, the monophyletic of the three Martagon species (L. hansonii, L. tsingtauense, and L. distichum) were rejected12,92 prior to our study, so far, the position of L. distichum has not been clear enough due to restriction of the sampling of Lophophorum and Nomocharis. Based on our topology tree and with the help of a richer sampling, L. distichum is further away from its companions in previous studies. Our phylogenetic tree shows L. distichum close to two species of Archelirion (L. japonicum and L. speciosum). Moreover, in terms of flower morphology, Comber’s Martagon members have differences from each other. The flowers of L. distichum are outfacing, whereas those of L. tsingtauense and L. martagon are upright and nodding, respectively19.

As shown in the results (Fig. 7), L. henricii and L. souliei, two species of Comber’s Sinomartagon, are placed next to Nomocharis-Lilium (L. gongshanense, L. meleagrinum, and L. pardanthinum). Gao et al.96 based on biogeographic results, showed that L. henricii is associated with Nomocharis species. What is more, Gao et al.97 by examining 38 Lilium species and 7 Nomocharis species using the ITS dataset, showed L. souliei inside the Nomocharis clade.

Based on our CCG topology, L. amoenum of Lophophorum was sister to L. bakerianum, a member of clade V. Zhou et al.98, based on fluorescence in situ hybridization, showed the signal pattern of 35S rDNA in L. amoenum was the same as in L. bakerianum. Moreover, these researchers, by mapping the chromosome pattern for 35S rDNA based on ITS data, showed that these two species are monophyletic.

The phylogenetic position of L. ledebourii was very ambiguous due to the scarcity of molecular information. To date, two studies have attempted this. Kim and Kim8 involved L. ledebourii (with 15 other species) in building the phylogenetic tree. However, this research does not provide a clear picture of the position of this species, due to sampling restriction and the use of only four chloroplast genes (rbcL, matK, ndhF, and atpB), which according to the genome-scale results, are not among the most divergent hotspots. Ghanbari et al.99, based on the ITS marker, have examined the position of this species and shown that L. ledebourii (Damash sample) is far from the L. candidum.

Following the resolution of L. ledebourii, as one of the study's objectives, and according to Kim et al.12, who reported that L. candidum position remained uncertain, interestingly here, according to cp genome-scale comparisons, L. ledebourii and L. candidum were monophyletic. L. ledebourii is a rare species that has only been seen in Iran and Azerbaijan20. Furthermore, L. candidum is thought to have originated in Persia and Syria100. In total, our findings indicate that a whole cp genome phylogenomic comparison would resolve much controversy and pave the way for Lilium phylogeny, especially for L. ledebourii.

Conclusions

The whole chloroplast genome of L. ledebourii is reported for the first time. The current study using whole chloroplast genomes of Lilium revealed structural characteristics, sequence diversity, and enhanced links between species. Meanwhile, certain variation hotspots identified as high variability regions could function as particular DNA barcodes. We provide a comprehensive analysis of selection pressure, and in the whole cp protein scale, Lilium species were subjected to a purifying selection. This study covered the restriction of sampling of Lophophorum and Nomocharis as much as possible. For the first time, L. ledebourii participated in the classification of genus Lilium, and its position was determined. The position of some species, e.g., L. distichum, became clearer than before. It is suggested that L. brownii can migrate from the Archelirion to the Leucolirion. We believe that the Lilium species have been classified with more excellent resolution than in earlier studies, which will be helpful in the understanding of the evolution of Lilium species. The genetic resources provided here will aid future studies in species identification, population genetics, and Lilium conservation.

Materials and methods

Sample collection and DNA extraction

Fresh leaves of L. ledebourii were sampled from Damash village and frozen in liquid nitrogen. The leaf samples were gathered in compliance with national and international legislation and guidelines. It was certified by the herbarium of the Faculty of Agricultural Science and Engineering, University of Tehran, and a validated voucher specimen was deposited at the Department of Horticultural Science, with voucher specimen number 6594. The total genomic DNA was isolated from leaves utilizing a DNeasy plant DNA extraction kit (Qiagen, USA) and the manufacturer's guidelines. DNA integrity was evaluated applying 1% agarose gel, and DNA quantification was assessed employing a NanoDrop spectrophotometer. Extracted DNA was stored at − 80 °C.

Chloroplast DNA sequencing and genome assembly

SMRT library with a 15–20 Kb insert size was sequenced applying the PacBio RS II platform in Duke Center for Genomics and Computational Biology, USA. To extract potential chloroplast sequences, the PacBio data were mapped to the reference L. hansonii cp genome (KM103364) data using BLASR101. Error correction was performed on SMRT reads by sprai pipeline102. The corrected reads were assembled employing Perl-based pipeline103. Furthermore, overlapping ends were checked by “check_circularity.pl” script.

Chloroplast genome annotation

GeSeq104, with NCBI RefSeq for Lilium as the reference dataset, was utilized for the annotation of protein-coding, ribosome RNA (rRNA), and transfer RNA (tRNA) genes. Using BLAST against cp genes, sequence coordinates of all annotated genes were checked and manually edited. The tRNAscan-SE version 2.0 was used to double-check the tRNA genes105. A circular physical map of the chloroplast genome was illustrated using OrganellarGenomeDRAW (OGDRAW) toolkit106 and Chloe (https://github.com/ian-small/chloe).

Repeat sequence analysis

The MicroSAtellite identification tool (MISA) was applied to screen simple sequence repeats (SSRs) in 48 cp genomes with a threshold of 10 for mononucleotide simple sequence repeats (SSRs), 5 for di-, 4 for tri-, and 3 for tetra-, penta-, and hexa nucleotide107. Tandem repeats were also discovered using default parameters by Tandem Repeats Finder v4.09108 using default parameters. Moreover, Vmatch V2.3.1109 was used to identify palindromic repeats (≥ 20 bp) and dispersed (≥ 30 bp).

Chloroplast genome comparison

In order to discover the Lilium divergence regions, the distance among adjoining genes and junctions of L. ledebourii SSC, LSC, and IRs regions, were compared to the other Lily cp genomes species. For each particularized codon, the ratio of usage frequency was obtained as the Relative Synonymous Codon Usage (RSCU) value using DAMBE V6 for 48 Lilium species110. We mapped the results of codon preference via the R program. To evaluate the degree of codons bias, the CodonW V1.4.4 (http://codonw.sourceforge.net) calculated the values of the codon bias index (CBI), the codon adaptation index (CAI), the frequency of optimal codons (Fop), the effective number of codons (NC), and the GC content of synonymous third codon positions (GC3s)111. To discover mutation hotspot sites, the nucleotide diversity in the Lilium chloroplast genomes was quantified employing sliding window analysis via DnaSP 6 software112. The window length and step size were fixed at 600 bp and 200 bp, respectively.

Selection pressure on Lilium cp genomes

We extracted the CDS sequences of the protein-coding genes from all 48 species, and flushed out those with lacking data in at least one species, which resulted in 78 CDS matrices. MAFFT v7113 was used to generate CDS alignments. DnaSP 6112 was used to compute the rates of nonsynonymous (Ka) and synonymous (Ks) nucleotide substitution. The selective pressure was measured using the Ka/Ks ratio, with Ka/Ks < 1, Ka/Ks = 1, and Ka/Ks > 1, indicating purifying, neutral, and positive selection, respectively67.

Phylogenetic analysis

The MAFFT v7 program was employed to align the complete cp genome sequence of 47 Lilium species113. Two complete sequences of Fritillaria hupehensis‌ and Fritillaria cirrhosa‌‌ were applied as outgroups. Utilizing the effective nucleotide substitution model (GTR + G), the maximum likelihood was carried out with RAxML v8.2.11114 with 1000 bootstrap repetitions to construct the phylogenetic tree. Finally, the iTOL tool was employed for visualizing the coming about phylogenetic tree115. Furthermore, phylogenetic analyses were also carried out for the single-copy protein coding genes (CDSs) shared by all 47 species.