Complete chloroplast genome sequences of Lilium: insights into evolutionary dynamics and phylogenetic analyses

Lilium is a large genus that includes approximately 110 species distributed throughout cold and temperate regions of the Northern Hemisphere. The species-level phylogeny of Lilium remains unclear; previous studies have found universal markers but insufficient phylogenetic signals. In this study, we present the use of complete chloroplast genomes to explore the phylogeny of this genus. We sequenced nine Lilium chloroplast genomes and retrieved seven published chloroplast genomes for comparative and phylogenetic analyses. The genomes ranged from 151,655 bp to 153,235 bp in length and had a typical quadripartite structure with a conserved genome arrangement and moderate divergence. A comparison of sixteen Lilium chloroplast genomes revealed ten mutation hotspots. Single nucleotide polymorphisms (SNPs) for any two Lilium chloroplast genomes ranged from 8 to 1,178 and provided robust data for phylogeny. Except for some of the shortest internodes, phylogenetic relationships of the Lilium species inferred from the chloroplast genome obtained high support, indicating that chloroplast genome data will be useful to help resolve the deeper branches of phylogeny.

genome regions have frequently been used in Lilium molecular systematics, including matK, rbcL, ndhF, and spacer regions of trnL-F, rpl32-trnL, trnH-psbA, or their combination. In addition, these studies have shown an incongruence between plastid and nuclear phylogenies 4,12,16 . A similar incongruence has been reported in recent studies of the genus Oryza 17 , the tribe Arundinarieae 18 , the genera Medicago 19 and Ilex 20 and has been attributed to the use of markers with insufficient phylogenetic signals, incomplete lineage sorting, or complex evolutionary issues. Moreover, the selected loci unfortunately have not provided sufficient phylogenetic resolution at the species level for Lilium. For the conservation, utilization, and domestication of Lilium plants, more effective molecular markers are needed to identify Lilium species and evaluate the population genetics and breeding for the Lilium genus. DNA barcoding can be used to elucidate plant relationships at the species level; therefore, the identification of high-resolution molecular markers at the species level is critical to the success of DNA barcoding in plants 21 .
Chloroplast (cp) is the key organelle for photosynthesis and carbon fixation in green plants 22 , and therefore, their genomes could provide valuable information for taxonomic classification and the reconstruction of phylogeny because of sequence divergence among plant species and individuals 23 . Due to their maternal inheritance, very low recombination and haploidy, cp genomes are helpful for tracing source populations and phylogenetic studies of land plants for resolving complex evolutionary relationships [24][25][26] . Typical cp genomes in angiosperms have a generally conserved quadripartite circular structure with two copies of inverted repeat (IR) regions that are separated by a large single copy (LSC) region and a small single copy (SSC) region 27,28 . These genomes with sizes in the range of 120-170 kb typically encode 120-130 genes.
The use of whole chloroplast genomes as a universal barcode and the existence of variable characters among the chloroplast genomes at the species level have recently been demonstrated, helping to overcome the previously low resolution in plant relationships 17,[20][21][22][23][29][30][31][32][33] . With the rapid development of next-generation sequencing, it is now more convenient and relatively inexpensive to obtain cp genome sequences and extend gene-based phylogenetics to phylogenomics.
In this study, we present the complete chloroplast genomes of nine Lilium species through NGS sequencing and add seven species from GenBank [34][35][36][37][38][39][40] . We then test the feasibility of phylogeny reconstruction using the chloroplast genome. We further perform an analysis to gain insights into the overall evolutionary dynamics of chloroplast genomes in Lilium.

Results
Genome sequencing and assembly. Using the Illumina HiSeq 4000 system, nine Lilium taxa were sequenced to produce 3,719,304-11,167,835 paired-end raw reads (150 bp in average read length). Lilium cp genomes were de novo assembled using SPAdes 3.6.1. After these paired-end reads were screened through alignment with the chloroplast genome using Geneious V9, 47,505 to 988,478 cp genome reads were extracted with 46 X to 971 X coverage ( Table 1). The four junction regions in each genome were validated by PCR-based sequencing according to Dong et al. 41 .
Complete chloroplast genomes of Lilium species. The nucleotide sequences of the 16 Lilium cp genomes range from 151,655 bp (L. bakerianum) to 153,235 bp (L. fargesii; Fig. 1, Table 2). The Chloroplast genomes assembled in single circular, double-stranded DNA sequences, displaying a typical quadripartite structure, consisting of a pair of IRs (26,394-26, Table S1). All four rRNA genes are duplicated in the IR region. Fifteen distinct genes contain one intron, two of which contain two introns (clpP and ycf3). The rps12 gene is a trans-spliced gene with the 5′ end located in the LSC region and the duplicated 3′ end in the IR region, as has been reported previously in other plants.
Simple Sequence Repeats (SSR) analysis of the Lilium cp genome. We used MISA to detect the SSR sites of all 16 chloroplast genomes. The number of SSRs in chloroplast genomes differed among the sixteen Lilium species, as shown in Table 2. The number of SSRs varied from 53 to 78. The most abundant were mononucleotide repeats, which accounted for approximately 56.38% of the total SSRs, followed by dinucleotides and tetranucleotides (Table S2). Hexanucleotides are very rare across the cp genomes.

Species
Raw data no.   In Lilium, all mononucleotides (100%) are composed of A/T, and a similar majority of dinucleotides (70.31%) are composed of A/T (Fig. 2). Our findings are comparable to previously reported findings that chloroplast genome SSRs are composed of polyadenine (polyA) or polythymine (polyT) repeats and rarely contained tandem guanine (G) or cytosine (C) repeats. Most of those SSRs are located in the LSC and SSC regions. In general, the SSRs of the sixteen Lilium species represent abundant variation and can be used in combination with nuclear SSRs developed in the genus for conservation or reintroduction, species biodiversity assessments and phylogenetic studies of Lilium in native or introduced areas.
Genome sequence divergence among Lilium species. We compared nucleotide diversity in the total, LSC, SSC, and IR regions of the cp genomes. The alignment revealed high sequence similarity across the Lilium cp genomes, suggesting that they are highly conserved. In total, 3,182 variable sites (2.03%), including 1,449 parsimony-informative sites in the total cp genomes were found (0.93%; Table 3). Among these regions, IR regions exhibite the least nucleotide diversity (0.00093) and SSC higher divergence (0.00839).

Phylogenetic analysis.
In the present study, five datasets (whole complete cp genome sequences, LSC, SSC, IR and ten combined variable regions) from cp genomes of sixteen Lilium and four outgroups as well as Smilax china were used to perform phylogenetic analysis. Using MP, ML and MrBayes analyses, phylogenetic trees were constructed based on five datasets (Figs 4 and 5, Fig. S1). The topologies based on the three methods of analysis were highly concordant in each dataset, as well as with the results of Rønsted et al. 41 and the phylogenetic trees had moderate to high support, except for the IR dataset, which received poor support. In addition, Fritillaria species or added Smilax china were used as the outgroup. The results showed that different outgroups could not influence the ingroup topology in our research (Figs 4 and 5, Fig. S1).
The sixteen Lilium species were grouped into two branches (Fig. 4). All the datasets indicated that two sect.

Discussion
Chloroplast genome evolution in Lilium. In this study, nine new chloroplast genome sequences of Lilium were sequenced using the Illumina HiSeq platform. The complete cp genomes range from 151,655 to 153,235 bp, which is within the range of cp genomes from other angiosperms 42 . The cp genomes of Lilium are highly conserved, with identical gene content and gene order and genomic structure comprising four parts. Such a low GC content has also been found in other angiosperm chloroplast genomes 43 . Through a comparative analysis of Lilium cp genome sequences, we rapidly developed molecular markers such as single nucleotide polymorphism (SNPs), and SSRs a type of 1-7 nucleotide unit tandem repeat sequence frequently observed in cp genomes, have been shown to have significant potential applications. SSRs are These markers are widely used in population genetics and breeding program studies 44,45 because of their high polymorphism even within species, due to slipped-strand mispairing on a single DNA strand during DNA replication 46 . In this study, 1,043 SSRs were identified in sixteen Lilium cp genomes. The most abundant are mononucleotide repeats, accounting for more than 56.38% of the total SSRs, followed by the di-, tri-, tetra-, and pentanucleotides. These new resources will be potentially useful for population studies in the Lilium genus, possibly in combination with other informative nuclear genome SSRs.
The nucleotide substitution rate is a central question in molecular evolution 47 . Based on the number and distribution of SNP and proportions of variability, the sequence divergence of the IR region is lower than that in LSC and SSC regions, also occurring in many previously reported plants 30,48 . All pairwise sequence comparisons in our study reveal that DNA sequences evolve at different rates in different species. This result has also been found in other taxa 49 .
Because Lilium contains more than 100 species, its DNA barcoding and taxonomy are difficult to assess. The rbcL, matK, trnH-psbA, and ITS genes have been widely used to investigate taxonomy and DNA barcoding at the interspecific level (China Plant BOL Group 2014). In DNA barcoding or molecular phylogenetic studies of Lilium, these markers had extremely low discriminatory power [11][12][13] . The indel and SNP mutation events in the genome were not random but clustered as "hotspots. " Such mutational dynamics created the highly variable regions in the   Table 4. Ten regions of highly variable sequences of Lilium.  genome 30 . Therefore, based on our study, the largest sequence divergence regions are trnS-trnG, trnE-trnT-psbD, trnF-ndhJ, psbE-petL, trnP-psaJ-rpl33, psbB-psbH, petD-rpoA, ndhF-rpl32-trnL, ycf1a, and ycf1b. Regions ycf1a and ycf1b are particularly highly variable among Lilium, and they have been added as a core plant DNA barcode 22 .
The trnE-trnT-psbD, trnS-trnG and ndhF-rpl32-trnL regions have been widely used for phylogenetic studies 20,50 . Two rarely reported highly variable regions, psbB-psbH and petD-rpoA, present in the Lilium cp genome were identified in the present study.
Inferring the phylogeny with chloroplast phylogenomics in Lilium. Phylogenetic analyses based on complete plastid genome sequences have provided valuable insights into relationships among and within plant genera. Early studies have been conducted to position uncertain families in angiosperms, such as Amborellaceae 51 , Nymphaeaceae 52 , and Nelumbonaceae 53 . With the recent advent of NGS technology, chloroplast genomes can be sequenced quickly and cheaply, and they have been successfully used to address various phylogenetic questions at the family and even at the species level [54][55][56] .
In this study, different datasets produced similar topological structures, except the IR dataset, possibly because IR is more conserved and provides fewer information sites than those found in SC regions (Table S2). All trees based on the datasets (except the IR dataset) were not only coincident with the previous phylogenetic studies based on ITS sequences 15,16 or the commonly used chloroplast genes such as matK, rbcL, atpB and atpF-H 4, 12, 57 but also had higher bootstrap values and resolution, especially at low classification levels. For example, the Martagon clade, including L. sp., L. tsingtauense and L. hansonii, received a higher robust support in the dataset (cp genome, LSC, SSC and ten variable regions) than the clade based on other markers (Figs 4 and 5; Fig. S1). Furthermore, species of L. sp. and L. tsingtauense were found to form a robustly supported clade ([ML] Bootstrap = 99, [MP] Bootstrap = 100 and PP = 1), suggesting that the two species are likely the same. Phylogenetic trees based on the cp genome, LSC, SSC and ten variable regions datasets support (with low support) that L. distichum (from GenBank) form a clade with the clade of L. bakerianum and L. nepalense var. ochraceum or the clade of L. fargesii and L. duchartrei (Figs 4 and 5; Fig. S1). However, L. distichum possesses a whorled leaf and is attributed to the sect. Martagon. Therefore, the species identification of L. distichum from GenBank may be inaccurate. However, evolutionary relationships and divisions within species/section need further investigation.
This study used the cp genome data to infer the phylogenetic relationships in Lilium, providing genome-scale support. The cp genome is expected to be useful in resolving the deeper branches of the phylogeny as more whole-genome sequences become available in Lilium.

Methods
Plant material and DNA extraction. Fresh leaves of nine Lilium species were sampled (Table 5).
Specimens were deposited in the herbarium of the Institute of Botany, Chinese Academy of Sciences (PE) ( Table 5). Total genomic DNA was extracted using a plant genome extraction kit (Tiangen, Beijing, China). Subsequently, DNA concentration was measured using a NanoDrop spectrophotometer 2000 (Thermo Fisher Scientific, America). Genome sequencing, assembly and annotation. DNA was sheared to construct a 400 bp (insert size) paired-end library in accordance with the Illumina HiSeq 4000 standard protocol. The paired-end reads were qualitatively assessed and assembled using SPAdes 3.6.1 58 . The gaps were filled by PCR amplification and Sanger sequencing. Sanger sequence reads were proofread and assembled with Sequencher 4.10 (http://www.genecodes. com).
All genes encoding proteins, transfer RNAs (tRNAs), and ribosomal RNAs (rRNAs) were annotated on Lilium. Plastomes were annotated using Dual Organellar Genome Annotator (DOGMA) software and the tRNAscan-SE 1.21 program 59,60 . Initial annotation, putative starts, stops, and intron positions were determined by comparison with homologous genes in other Lilium cp genomes.
Microsatellite analysis. Perl script MISA 61 was used to detect microsatellites (mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats) with the following thresholds (unit size, min repeats): ten repeat units for mononucleotide SSRs, five repeat units for dinucleotide SSRs, four repeat units for trinucleotide SSRs, and three repeat units each for tetra-, penta-, and hexanucleotide SSRs.