Chloroplast genomes elucidate diversity, phylogeny, and taxonomy of Pulsatilla (Ranunculaceae)

Pulsatilla (Ranunculaceae) consists of about 40 species, and many of them have horticultural and/or medicinal value. However, it is difficult to recognize and identify wild Pulsatilla species. Universal molecular markers have been used to identify these species, but insufficient phylogenetic signal was available. Here, we compared the complete chloroplast genomes of seven Pulsatilla species. The chloroplast genomes of Pulsatilla were very similar and their length ranges from 161,501 to 162,669 bp. Eight highly variable regions and potential sources of molecular markers such as simple sequence repeats, large repeat sequences, and single nucleotide polymorphisms were identified, which are valuable for studies of infra- and inter-specific genetic diversity. The SNP number differentiating any two Pulsatilla chloroplast genomes ranged from 112 to 1214, and provided sufficient data for species delimitation. Phylogenetic trees based on different data sets were consistent with one another, with the IR, SSC regions and the barcode combination rbcL + matK + trnH-psbA produced slightly different results. Phylogenetic relationships within Pulsatilla were certainly resolved using the complete cp genome sequences. Overall, this study provides plentiful chloroplast genomic resources, which will be helpful to identify members of this taxonomically challenging group in further investigation.

In Europe, some species of Pulsatilla are rare, endangered and endemic. Those taxa are protected due to their small populations and disappearing localities, and those species have been placed on the Red Lists of Endangered Species 20 .
Taxonomically, Pulsatilla is an especially complex and challenging group. In all treatments published before, three subgenera have been recognized: subgenus Kostyczewianae (only one species), subgenus Preonanthus, and the largest subgenus Pulsatilla. However, the intragenic morphological variability of Pulsatilla was especially complicated 12 . The recognition and identification of wild Pulsatilla species is particularly difficult based on traditional approaches.
Molecular markers are significant to explore the phylogenetic relationships of the genus Pulsatilla. Phylogenetic relationships between Pulsatilla and closely related genera have been dedicated during the past years [21][22][23][24][25] . Previous studies have attempted to identify these species among Pulsatilla with universal molecular markers, but the species resolution was relatively low 15 .
In this study, we present seven complete cp genomes from two subgenera of Pulsatilla obtained through next-generation sequencing (NGS) and genomic comparative analyses with four previously published cp genome sequences of Pulsatilla from NCBI, with Anemoclema glaucifolium as the outgroup. We identify microsatellites (SSRs), larger repeat sequences, and highly variable regions, with the aim of developing DNA barcodes and testing the feasibility of phylogenetic analyses of Pulsatilla using the chloroplast genome.

Results and discussion
Chloroplast genome features. We have obtained 1.95 Gb of average NGS clean data for each species, with minimum and maximum values of 1.14 Gb (P. dahurica), and 3.56 Gb (Pulsatilla alpina), respectively. The read number for each species ranged from 6,468,944 (P. dahurica) to 15 Table 2). Previous studies of other angiosperm groups have found that chloroplast genomes are conserved 28 or highly polymorphic 29,30 . These genomes which we reported are highly conserved in gene order, gene content and intron number, which is in Table 1. Voucher information and GenBank accession numbers for Pulsatilla and outgroups. Species with asterisks were collected by this study, whereas others were obtained from Genbank. NA not applicable. Chloroplast genome comparison. In most angiosperms, the IR regions of cp genomes of angiosperms are highly conserved, but the expansion and contraction of IR region boundaries are ever present 33,34 . At the same time, several lineages of land plant chloroplast genomes show great structural rearrangement, even loss of IR regions or some gene families 35 . The expansion and contraction in IRs are significant evolutionary events, because they can change gene content and chloroplast genome size 30,36 . Expansion of the IRs has been reported in Araceae 36,37 . Sometimes, the size of LSC increases and that of SSC decreases, becoming only 7000 bp in Pothos 38 . At the same time, a linear chloroplast genome was also reported in some groups, e.g. maize 35,39 . Expansion and contraction of the IR regions can also lead to duplication of certain genes or conversion of duplicate genes to single copy, respectively 30,36 . Changes in the size of the IRs can also cause rearrangement of the genes in the SSC as recently observed in Zantedeschia 36 .
The Pulsatilla chloroplast genomes were compared to previously published data and showed typical Anemoneae (Ranunculaceae) genome structure 37,39 . As reported for Anemoclema, Anemone, Clematis and Hepatica, the www.nature.com/scientificreports/ IR regions of genus Pulsatilla are roughly 4.4 kb longer than those of other genera of the family Ranunculaceae, such as Aconitum, Coptis, Thalictrum, Megaleranthis, Ranunculus, and Trollius 37,39 . The gene orders located within the IR-SSC and IR-LSC boundaries are similar among tribe Anemoneae but different from those of other genera of Ranunculaceae (Fig. 2, Fig. S1). We compared the IR/SC boundary regions of Pulsatilla, and the junction positions are very similar and conserved within genus Pulsatilla. In the four boundary regions (LSC, IRa, SSC, IRb) of seven Pulsatilla cp genomes, the LSC/IRa and IRb/LSC border was in the intergenic region, and the adjacent genes is rps36, rps8 and rps4, respectively. The genes ycf1 andψycf1 have crossed the SSC/IRb and IRa/SSC boundary, respectively, which was also found in Monsteroideae (Araceae) 28 . The pseudogene ycf1 has been found in other groups 30,36 . The IR regions were highly conserved, with nucleotide diversity values in those regions less than 2%.
Sequence divergence. Multiple alignments of plastid genomes were performed to investigate levels of genome divergence. Based on MAFFT analysis, there are three inversions in LSC of Pulsatilla, same as the tribe Anemoneae ( Fig. 1, Fig. S1) 37 . The mVISTA analysis has revealed high sequence similarity across the coding region and there exists more variability in non-coding regions. Sequence identity among the seven species was 96.68-98.66%. The number of nucleotide substitutions and sequence distance (Pi) were the highest (1214, 0.0063) between P. alpine and P. dahurica, with the lowest (112, 0.0005) between P. vernalis and P. patens ( Fig. 3; Table 3).

Identification of highly variable regions.
Chloroplast genome markers, especially several universal chloroplast regions, have been widely used in plant systematics and identification at multiple taxonomic levels. Highly suitable polymorphic chloroplast loci have been identified and designed as unique markers in different groups 28,40 . However, relationships within the genus Pulsatilla have not been well resolved because of the low www.nature.com/scientificreports/ polymorphism of these universal markers 15 . In order to facilitate identification of closely related species of Pulsatilla, we sought to identify highly variable regions of the chloroplast genome, as previously described 9,27,41-44 .
As a result, we identified nine divergent hotspot regions, including six intergenic spacer regions (rps4-rps16, rps16-matK, ndhC-trnV, psbE-petL, ndhD-ccsA, ccsA-ndhF) and four protein-coding regions (ycf1, ndhF, ndhI) ( Fig. 4; Table 4). Most commonly employed loci, e.g. trnL-trnF, trnH-psbA were not selected in our finding. The nine highly variable regions included 684 variable sites, including 181 indels. However, these indels are not suitable for the phylogenetic inference because Maximum likelihood model used only substitutions not indels 28 . Their nucleotide diversity values ranged from 0.00802 to 0.02212. The region of ccsA-ndhF showed the highest variability, the next most variable regions were rps4-rps16, ndhC-trnV, and psbE-petL. The diversity level of two protein-coding regions (ycf1, ndhF) was the lowest. Among the nine divergent hotspot regions, the ndhI is difficult to align. There are large numbers of indels in ndhI and the intergenic spacer between ndhI and ndhG, these regions were not considered suitable for the phylogenetic inference of the Pulsatilla. Thus, we selected eight regions, four (rps4-rps16, rps16-matK, ndhC-trnV, psbE-petL) in the LSC and four (ndhD-ccsA, ccsA-ndhF, ycf1, ndhF) in the SSC, with relatively high variability as potential molecular markers for the study of species identification and phylogeny in Pulsatilla. Five hotpots were found in chloroplast genome of Veroniceae (Plantaginaceae), and two universal marker, trnH-psbA and matK were identified, respectively 45,46 . Ten highly variable regions were selected as potential molecular markers for Fritillaria, including ycf1 44,47 , which was also selected in this study. Sequences of these variable regions founded in this study could be regarded as potential molecular markers for species identification and evolutionary studies and have been shown to be valuable for studies in other groups (e.g., Fritillaria) 44 .
SSRs and large repeat sequences. Oligonucleotide repeats play an important role for generating indels, inversion and substitutions 29 . Repeat sequences in the chloroplast genome could provide valuable information for understanding not only the sequence divergence but the evolutionary history of the plants [48][49][50] . We have detected five types of large repeats (forward, reverse, palindromic, complement and tandem repeats) in the seven Pulsatilla cp genomes. Among them, the most common repeat types are forward and palindromic repeats, followed by reverse repeats, and only little complement repeats were found in Pulsatilla cp genomes (Fig. 5A). Most of the repeats were short, ranging from 30-49 bp (Fig. 5B).
We also identified multiple microsatellite repeats, also known as simple sequence repeats (SSR) or short tandem repeats (STR) 49 . Due to their codominant inheritance and high variability, SSRs are robust and effective www.nature.com/scientificreports/ markers for species identification and population genetic analyses [49][50][51][52][53] . Most of the mononucleotide repeats were composed of A/T. The other microsatellites types were also dominated by AT/TA, with very little G/C (Fig. 5C). In this study, plentiful microsatellite loci were found through the comparative analysis of Pulsatilla cp genome sequences. In total, we detected six types of microsatellite (mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide and hexanucleotide repeats) based on the comparison of seven Pulsatilla cp genomes (Fig. 5D). Each Pulsatilla cp genome had 69-87 microsatellites. The lengths of repeat motifs of these microsatellites ranged from 10 to 21 bp (Fig. 5E). Among the four structural regions in the cp genomes, most of the repeats and microsatellites were distributed in LSC, followed by SSC, and fewest in IRa/IRb (Fig. 5F), which were also reported in other studies in angiosperms 29,30 . These SSRs and repeat sequences are uncorrelated with genome size and phylogenetic position of the species 36 , but will provide important information for further studies of phylogenetic reconstruction and infra-and inter-specific genetic diversity 54,55 . Phylogenetic analyses. Chloroplast genomes have been widely used and have made significant contributions to phylogeny reconstruction at different taxonomic levels in plants 7,9,24,35 . To better clarify the evolutionary  www.nature.com/scientificreports/ relationships within Pulsatilla, we used each data set to construct phylogenetic trees using the ML analytical methods. We also construct phylogenetic trees with those eight highly variable regions using the ML, MP analytical methods. All tree topology structures were identical. Therefore, here we presented the phylogenetic studies using the ML tree with the support values from the MP analyses recorded at the corresponding nodes (Fig. 6). The phylogenetic tree based on all data sets (except the IR and SSC regions) from the complete plastid genome sequences yielded the same topology. The phylogenetic tree based on chloroplast genome differed from that of  www.nature.com/scientificreports/ the DNA barcode combination rbcL + matK + trnH-psbA, but with higher support values. The phylogenetic trees based on data from complete plastid genome sequences showed that the species of Pulsatilla formed a monophyletic group which in turn includes two strongly supported (bootstrap = 100) clades. One clade comprised P. alpina and P. occidentalis, members of subg. Preonanthus. The other comprised two subclades: (1) members of P. hirsutissima, P. ludoviciana, P. multifidi, P. patens and P. vernalis, and (2) species of P. chinensis, P. dahurica, P. grandis and P. pratensis. All the species of the two subclades are members of the subg. Pulsatilla. These results were congruent with our former results based on universal markers 15 .
In phylogenetic analyses, compared to the combination of barcodes, the full chloroplast genome sequence data formed distinct clades with high bootstrap support, improving the inadequate resolution of barcodes combination. The LSC regions and coding regions have the same topology structures with robust support. However, sequencing of genomic DNA is still expensive. It is necessary to utilize variation within chloroplast regions for rapid species-specific assay 5,9,31,33,42 . Here we found that phylogenetic inference based on highly variable regions yielded a tree with the same topology as the one recovered based on complete chloroplast genome sequences, demonstrating the high utility of hotspots of variability for species identification and phylogenetic analysis. More samples and laboratory works are needed in the future to increase the number of these variable regions available for study.

Conclusions
In this study, we generated complete chloroplast genomes of seven species of Pulsatilla and compared them to four previously published cp genome sequences of Pulsatilla. Chloroplast genomes of Pulsatilla share many features with those of other angiosperms. Informative differences between cp genomes of Pulsatilla were most evident in inversions in the large single copy region and expansion of the inverted repeat region. We identified multiple potentially valuable genetic markers, including large repeat sequences, numerous SSRs, and eight highly variable regions. Genetic markers have provided a reference for the improvement of plants fingerprints and the identification of similar Pulsatilla. In addition, it will be better to construct the phylogenetic relationships of Pulsatilla species using these highly variable regions and genetic markers in the future. Totally, this study provides a basis for future studies of horticultural cultivation, conservation, population genetics, phyletic evolution, development of DNA barcodes, and diverse research in Pulsatilla.

Materials and methods
Plant material, DNA extraction and sequencing. Fresh leaves of P. dahurica were collected from Jilin province of China and dried with silica gel. Dry leaves of other six Pulsatilla species were taken from herbarium specimens. We extracted total genomic DNA with the DNeasy Plant mini kits (QIAGEN, Guangzhou, China). The genomic DNA was sequenced using the Illumina Miseq platform (Illumina, San Diego, CA, USA).
Chloroplast genome assembly and annotation. Whole chloroplast genome sequencing was done for the seven species of Pulsatilla. For each species, high-quality Illumina sequencing reads were assembled into scaffolds with de novo sequence assembly software Spades, SOAPdenovo and CLC Genomics Workbench v.6.5 (CLC Bio), respectively 56 . We checked the contigs against the reference genome of P. chinensis (MG001341), using BLAST (https ://blast .ncbi.nlm.nih.gov/) and oriented aligned contigs according to the reference genome. We mapped all the raw reads back to assembled sequences to check the assembly and then constructed the complete cp genomes using Geneious v.9.0 57 . We submitted all the newly sequence data in raw format (fastq) and obtained SRA accesions ( Table 1).

Genome comparisons.
We aligned the cp genomes of Pulsatilla using multiple alignment of MAFFT v7 62 and manually edited in Geneious v.9.0. The contraction and expansion of inverted repeat regions were also examined among the seven species (excluded P. chinensis, MG001341) of the genus Pulsatilla using Irscope 63 . Then, we performed multiple alignments of the eight genomes of Pulsatilla in the mVISTA program 64 under Shuffle-LAGAN mode, with default parameters for other options, using the annotation genome of P. chinensis as a reference, with the aim of comparing and visualized the similarities and differences among different plastid genomes.
To analyse chloroplast genome organisation and gene arrangement, we perform the analyses of collinear blocks with Mauve v 2.3.1 65 plugin in Geneious v.9.0, including only one copy of the IR, assuming collinear genomes for the full alignment. Detailed gene inversions were identified by comparing the gene order of Pulsatilla samples and Anemoclema to Berberis.
To observe the plastid genome divergence and determine parsimony informative sites, we conducted sliding window analysis after alignment to determine the nucleotide diversity (Pi) of the cp genome using DnaSP v5, with 200 bp of step size and 600 bp window length 66 . We defined hotspots as those regions with a higher value of Pi. We computed the variable sites across the complete cp genomes and the sequence characteristics of hotspots by DnaSP v5.0.

Repeated sequences identification.
We identified repeat sequences, including palindromic, reverse and forward repeats, using the online software REPuter (https ://bibis erv.cebit ec.uni-biele feld.de/reput er), with the following settings: Hamming distance of 3 and minimum repeat size of 30 bp 67 . We used the online program www.nature.com/scientificreports/ Tandem Repeats Finder (https ://tande m.bu.edu/trf/trf.html) to find the tandem repeat sequences, in which the similarity percentage of two repeat copies was at least 90% and the minimal repeat size was 10 bp. The alignment parameters for match, mismatch, and indels were set at 2, 7, and 7, respectively. We identified microsatellites (SSRs) by MISA 68 with thresholds of 10, 5, 4, 3, 3, and 3 for mono-, di-, tri-, tetra-, penta-, and hexa-nucleotide, respectively.
Phylogenetic analyses. For the purpose of reconstructing the phylogenetic relationships, four published complete cp genome sequences from the genus Pulsatilla and Anemoclema were also included in our analyses. The monotypic genus Anemoclema (MH205609) was selected as the outgroup. Because molecular evolutionary rates among the different cp genome regions are diverse, analyses of phylogenetic relationships were performed based on the following seven datasets: (a) the complete cp genome sequences; (b) coding genes (CDS); (c) one inverted repeat (IR) region (IRb); (d) the large single copy region (LSC); (e) the small single copy region (SSC), (f) the consensus sequences of eight highly variable regions; and (g) the DNA barcodes combination (rbcL + matK + trnH-psbA). We applied Maximum Likelihood (ML) analysis for each of the seven datasets to construct tree-sets. Maximum Parsimony (MP) analyses were also applied for the consensus sequences of eight highly variable regions and the DNA barcodes combination. We conducted ML analyses with RAxMLHPC2 v.8.0.9 69 on the Cyberinfrastructure for Phylogenetic Research (CIPRES) Science Gateway v.3.3 70 . Then the analysis of 1000 rapid bootstrap replicates (-x) was followed by a search for the best-scoring ML tree in one program (-f a). The best-fit model for nucleotide and amino acid sequences were evaluated using jModelTest 2 71 . We applied the GTR + G model to nucleotide data for both bootstrapping and best-tree searching phases, with other parameters as the default settings. We performed the maximum Parsimony (MP) analysis on PAUP* v.4.0b10 72 . All the characters were treated as unordered and equally weighted. The heuristic search specified 1000 random sequence addition replicates with TBR branch swapping, saving only 10 trees per replicate. We obtained the strict consensus tree from all the most-parsimonious trees (MPTs) detected during the search. We calculated bootstrap percentages (BP) from 10,000 rapid bootstrap replicates, each comprising 10 random sequence addition replicates and saving only one tree per replicate.