The complete chloroplast genome sequences of five pinnate-leaved Primula species and phylogenetic analyses

The six pinnate-leaved species are a very particular group in the genus Primula. In the present paper, we sequenced, assembled and annotated the chloroplast genomes of five of them (P. cicutarrifolia, P. hubeiensis, P. jiugongshanensis, P. merrilliana, P. ranunculoides). The five chloroplast genomes ranged from ~ 150 to 152 kb, containing 113 genes (four ribosomal RNA genes, 29 tRNA genes and 80 protein-coding genes). The six pinnate-leaved species exhibited synteny of gene order and possessed similar IR boundary regions in chloroplast genomes. The gene accD was pseudogenized in P. filchnerae. In the chloroplast genomes of the six pinnate-leaved Primula species, SSRs, repeating sequences and divergence hotspots were identified; ycf1 and trnH-psbA were the most variable markers among CDSs and noncoding sequences, respectively. Phylogenetic analyses showed that the six Primula species were separated into two distant clades: one was formed by P. filchnerae and P. sinensis and the other clade was consisting of two subclades, one formed by P. hubeiensis and P. ranunculoides, the other by P. merrilliana, P. cicutarrifolia and P. jiugongshanensis. P. hubeiensis was closely related with P. ranunculoides and therefore it should be placed into Sect. Ranunculoides. P. cicutarrifolia did not group first with P. ranunculoides but with P. merrilliana, although the former two were once united in one species, our results supported the separation of P. ranunculoides from P. cicutarrifolia as one distinct species.

Repeat element analysis. Simple sequence repeats (SSR) were detected for the cp genomes using MISA 29 .
Sequence divergence analysis. We calculated the nucleotide variability (Pi) values of the protein coding sequences, introns and intergenic spacers of the cp genomes of the six species with pinnatisect leaves and 16 other Primula species available in Genbank (Table S1) using DnaSP 6.12 31 . Phylogenetic analysis. We constructed phylogenetic trees by Maximum likelihood (ML) and neighborjoining (NJ) methods. Androsace paxiana and Lysimachia congestiflora were treated as outgroups. The ML phylogenetic trees (1000 bootstrap replicates) were inferred with RAxML 8.2.10 32 based on whole cp genomes (Tables S1), ycf1, and concatenation of ITS, matK and rbcL (Tables S2), respectively. ycf1 was extracted from the cp genomes (Tables S1), because the gene ycf1 did not exist in P. tsiangii, we used its homologous part in the cp genome. NJ analysis of 71 Primula species including the six species with pinnatisect leaves based on concatenation of ITS, matK and rbcL (Tables S2) was carried out using MEGA-X 33 with 1000 bootstrap replicates. ITS of the six pinnate-leaved Primula species was newly sequenced in this study; matK and rbcL were derived from the chloroplast sequences in this study and from the cp genome of P. filchnerae 21 . The accessions of the cp genomes and DNA fragments were listed in Tables S1, S2, respectively.
The sequencing coverage of our five newly assembled cp genomes was from 923 to 6237 ( Figure S1). The six cp genomes possessed typical quadripartite structure: IRa, IRb, LSC and SSC (Table 1), and they exhibited the same gene order, no gene rearrangement or inversion occurred ( Figure S2). The physical map of the cp genome of P. hubeiensis was shown in Fig. 1. The GC content was ~ 37%. The newly sequenced genomes ranged from 150,187 bp to 151,972 bp, harboring 113 genes: four ribosomal RNA genes, 29 tRNA genes and 80 protein-coding genes, and among them 14 genes was duplicated in IRa and IRb (Table 1). Due to presence of multiple stop codons, the gene infA was pseudogenized in the five newly sequenced species. The open reading frame (ORF) in accD of P. filchnerae (MK888698) was truncated to be only 1305 bp compared with 1455 or 1464 bp ORF of other five species. Lee et al. 34 identified five conserved amino acid sequence motifs in accD gene. Conserved amino acid sequence motifs IV and V were absent in accD of P. filchnerae. Therefore, accD was nonfunctional in P. filchnerae.

SSRs and repeats.
Five categories of SSRs were identified for the six species ( . Most or all mono-repeats were A/T repeats including 10 to 16 nucleotides. The number of repeat units ranged from five to eight for dinucleotide repeats. The tri-and penta-nucleotide SSRs consisted of four motifs, and tetra-nucleotide SSRs of four to five repeat units. Except the largest repeat for each genome (i.e. IRs), a total of 183 repeat pairs (three types: forward (F), reverse (R), and palindromic repeats (P)) were detected in the six genomes ( Fig. 2), which ranged from 30 to 137 bp in length. Palindromic repeats were the most common, accounting for 55.2% (101 of 183), followed by forward repeats (44.3%, 81 of 183). No complement repeats were identified in all species and one pair of reverse repeats existed specifically in P. ranunculoides. In the six species, 96.7% (177 of 183 repeat pairs) repeats were 30-59 bp in length, consistent with the length reported in other Primula species 20 . The longest repeat (137 bp) was found in P. cicutariifolia, and this species contained the most repeats (44 pairs), while P. jiugongshanensis had the least (24 pairs).
IR/SC boundary. The IR/SC boundary regions of the six Primula cp genomes were compared, and the IR/ SC junction regions showed slight differences in the length of organization genes flanking the junctions or the distance between the junctions and the organization genes (Fig. 3). The genes spanning or flanking the junction of LSC/IRb, IRb/SSC, SSC/IRa and IRa/LSC were rps19/rpl2, ndhF, ycf1, rpl2/trnH, respectively. IR expansion and contraction was observed. P. cicutarrifolia had the smallest size of IR but largest size of both LSC and SSC; though largest size of IR was detected in P. filchnerae, the LSC or SSC was not the smallest in this species. The gene trnH was located in LSC, 0-24 bp away from the IRa/LSC border. The largest extensions of ycf1 into both SSC and IRa occurred in P. filchnerae (4566 bp and 1023 bp, respectively) and ycf1 of P. filchnerae were the longest among the six species. The gene ndhF was utterly situated in SSC and 108 bp distant from the IRb/SSC junction in P. cicutarrifolia; in the rest five species the fragment size of ndhF in SSC was largest in P. hubeiensis (2194 bp). In P. cicutarrifolia, P. jiugongshanensis and P. merrilliana, rps19 and rpl2 were located in the upstream and downstream of the LSC/IRb junction, respectively; rps19 ran across the LSC/IRb junction in P. filchnerae, P. hubeiensis, P. ranunculoides with 161, 62, 56 bp extension in IRb, respectively. Divergent hotspots in the Primula chloroplast genome. As indicated by the value of Pi, the nucleotide variability of the 22 Primula species (Table S1) was evaluated by DnaSP 6.12 31  Phylogenetic analysis. The ML tree of 22 Primula species was constructed with RAxML 32 (Fig. 4), based on the whole cp genomes. The six pinnate-leaved Primula species did not form a monophyly, but separated into two distant clades. P. filchnerae grouped with P. sinensis, the other five species clustered together and constituted the clade Sect. Ranunculoides with 100% bootstrap. In the ML tree, Sect. Proliferae exhibited monophyly, while species of Sect. Crystallophlomis separated into different clades.
The topology of the ML tree based on ycf1 ( Figure S3) was consistent with that based on whole cp genomes (Fig. 4), except that the clade formed by P. veris and P. knuthiana were sister to the clade consisting of Sects. We also constructed both ML and NJ tree of 71 Primula species based on the concatenation of three common barcoding markers (ITS, matK and rbcL). Only the results of NJ analysis (Fig. 5) showed consistency with those of Yan et al. 12 , Liu et al. 35 , and ML analysis based on whole cp genomes (Fig. 4). The six pinnate-leaved Primula species were separated into two distantly related groups. The clade consisting of P. filchnerae and P. sinensis (Sect. Auganthus) was sister to the clade formed by Sects. Carolinella, Obconicolisteri, Monocarpicae, Cortusoides, Malvacea, Pycnoloba. The five pinnatisect-leaved species P. cicutarrifolia, P. hubeiensis, P. jiugonshanensis, P. merrilliana and P. ranunculoides (Sect. Ranunculoides) comprised a 100% supported clade, which was sister to the group containing Sects. Crystallophlomis, Petiolares, Proliferae, Amethystina. Sect. Carolinella and Sect. Crystallophlomis, and Sect. Malvacea were polyphyletic. www.nature.com/scientificreports/

Discussion
The six cp genomes of pinnate-leaved species were ~ 150-152 kb with similar GC content. The gene content and organization were similar and a high degree of synteny in gene order was observed across all the genomes. The gene accD was normal in five species but perhaps pseudogenized duo to lack of two conserved amino acid sequence motifs in P. filchnerae. In P. sinensis, this gene was pseudogenized and another copy of accD were detected in the nucleus 36 . Whether accD was functionally transferred to the nucleus in P. filchnerae needs further confirmation. Interestingly, P. filchnerae and P. sinensis always grouped together on the phylogenetic trees (Figs. 4, 5).
In the six Primula species, the IR/SC boundary regions exhibited similar feature, with slight differences observed in the length of organization genes flanking the junctions or the distance between the junctions and the organization genes (Fig. 3), and the situation is similar to ten other Primula species 20 , which indicates the structural conservation of Primula. Expansion of IR regions may cause size increase in chloroplast genomes 37 , however, it seems that the size of whole cp genomes did not always increase with expansion of IR in Primula. For example, among the six pinnate-leaved species, P. cicutarrifolia possessed the smallest IR (25,094 bp) but the largest whole genome size (151,972 bp); in P. filchnerae, the IR was the longest (25,568 bp), and the whole genome size (151,547 bp ) was only bigger than P. ranunculoides (150,187 bp). In P. kwangtungensis 20 , both IR (25,855 bp) and the whole genome size (153,757 bp) exceeded all other species (Table S1) including the six pinnate-leaved species.  /T  27  49  29  27  42  29   C/G  1  ---1  1   Di  AT/AT  8  5  9  9  8  8   Tri  AAG/CTT  2  --2  2  -Tetra   AAAT/ATTT  2  2  2  2  4  1 AACG/CGTT 2 - www.nature.com/scientificreports/ Except the IRs, 183 pairs of repeats were detected in the six cp genomes, only one which were longer than 70 bp (137 bp), which is similar in ten other Primula species, most of repeats ranged in size from 14 to 62 bp and all (except one pair of 111 bp repeat) were not large repeats (> 100 bp) 20 . No rearrangement was found in our six species, the reason may be lack of large complex repeating sequences (> 100 bp) just as suggested by Ren et al. 20 . The SSR marker analyses have been proven to be powerful to assess the genetic diversity and population structure of P. cicutarrifolia, P. merrilliana and P. sikkimensis 7,38,39 . The usefulness of the SSRs located in the six chloroplast genomes may be tried in future studies on population genetics of Primula species.
Using the six pinnate-leaved species cp genomes and 16 other Primula cp genomes available in NCBI, the divergence hotspots were identified among CDSs and noncoding regions. The nucleotide diversity (Pi) of ycf1 and matK reached 0.05036 and 0.04878, respectively, much higher than rbcL (0.02149), which was a common barcode for species identification. The gene ycf1 was considered to be the most promising barcode to identify plant species 40 . Two chloroplast genes, ycf1 and psbM-psbD, had much better discriminatory power (both 87.5%) than did other chloroplast barcodes for identifying Fritillaria species 41 . The ML species tree based on ycf1 (Figure S3) showed similar topology to that based on whole cp genome. Except matk CDS, other hotspots regions identified here were not tested for species identification or phylogeny reconstruction 42,43 . Among the noncoding sequences, trnH (GUG) -psbA was the most variable one, which showed better discriminatory power than matK and rbcL 43 . These highly variable regions have the potential to be used for Primula species discrimination or phylogeny investigation in future study.  Both ML and NJ phylogenetic analyses revealed that the six pinnate-leaved Primula species did not form a monophyletic group, probably due to parallel evolution of pinnately lobed or divided leaves 10 . In the ML and NJ trees, the phylogenetic placement of the clade consisting of P. filchnerae and P. sinensis was near to Sect. Carolinella and Sect. Obconicolisteri, which is similar to the results of Yan et al. 12 . Liu et al. 35 proposed Subgen. Auganthus (Sect. Auganthus, Bullatae, Cortusoides, Dryadifolia, Malvacea, Monocarpicae, Obconicolisteri, Pycnoloba) to include Subgen. Carolinella (Sects. Carolinella) and exclude P. aromatica, P. filchnerae and P. sinensis thus were in the basal clade of Subgen. Auganthus. The close relatedness of P. filchnerae and P. sinensis was also indicated by the pseudogenization of the gene accD. And our study showed that Sect. Ranunculoides (P. cicutarrifolia, P. hubeiensis, P. jiugongshanensis, P. merrilliana, P. ranunculoides) was closely related to Sects. Crystallophlomis. Li et al. 5 conjectured that P. hubeiensis resembled P. filchnerae, and might belong in Sect. Auganthus. However, the present study clearly indicated that P. hubeiensis grouped with P. ranunculoides first, P. cicutarrifolia grouped with P. merrilliana first, and therefore P. hubeiensis should be placed in Sect. Ranunculoides. He et al. 4 also demonstrated that P. cicutarrifolia was closely related with P. merrilliana. P. ranunculoides and P. cicutarrifolia were united into one species 2 but later separated as two species 3 , and our ML and NJ analyses strongly supported the taxonomic treatment of Shao et al. 3 .

Data availability
The complete chloroplast sequences generated and analyzed in this paper are available in GenBank (https :// www.ncbi.nlm.nih.gov/genba nk/, accession numbers listed in the text), the raw reads deposited in Genbank are SRR12179774-SRR12179778.