Chloroplast genome structure in Ilex (Aquifoliaceae)

Aquifoliaceae is the largest family in the campanulid order Aquifoliales. It consists of a single genus, Ilex, the hollies, which is the largest woody dioecious genus in the angiosperms. Most species are in East Asia or South America. The taxonomy and evolutionary history remain unclear due to the lack of a robust species-level phylogeny. We produced the first complete chloroplast genomes in this family, including seven Ilex species, by Illumina sequencing of long-range PCR products and subsequent reference-guided de novo assembly. These genomes have a typical bicyclic structure with a conserved genome arrangement and moderate divergence. The total length is 157,741 bp and there is one large single-copy region (LSC) with 87,109 bp, one small single-copy with 18,436 bp, and a pair of inverted repeat regions (IR) with 52,196 bp. A total of 144 genes were identified, including 96 protein-coding genes, 40 tRNA and 8 rRNA. Thirty-four repetitive sequences were identified in Ilex pubescens, with lengths >14 bp and identity >90%, and 11 divergence hotspot regions that could be targeted for phylogenetic markers. This study will contribute to improved resolution of deep branches of the Ilex phylogeny and facilitate identification of Ilex species.

their small size, uniparental inheritance, haploid nature, and highly conserved genomic structure 16 . Species identification is a particular problem in Ilex, where there are many similar species. Moreover, since numerous copies are present in each cell, useable fragments of the chloroplast genome are more likely to persist in dried herbarium specimens 18,19 , which is an important consideration for Ilex in China where many species are only known from the type collection. In addition, chloroplast genomes have been used to improve the resolution of the backbone of phylogenies built with nuclear markers 10 .
Here, we present the complete chloroplast genomes of seven Ilex species through Illumina sequencing and reference-guided assembly of de novo contigs. We then test the feasibility of phylogeny reconstruction using chloroplast genomes in Ilex. Topologies of the phylogenies constructed from different molecular datasets are compared, including the whole cp genome, the coding regions, and LSC, SSC, IR, and introns and spacers. A new chloroplast genome for Helwingia himalaica (Yao et al., submitted), from the most closely related family, Helwingiaceae, is used as the outgroup.

Results
Output of genome sequencing and assembly. For the seven Ilex species, 239,377 to 748,662 pairedend reads (90bp in average reads length) were produced by Illumina sequencing. 233,558 to 692,191 reads were mapped to the reference genome Camellia yunnanensis (GenBank accession number KF156838), after screening these paired-end reads by aligning them to the reference genome, on average reaching over 100 × coverage of the cp genome. After de novo and reference-guided assembly, complete cp genomes of seven Ilex species were obtained. The four junction regions in each genome were validated using PCR-based sequencing (see Supplementary Table S1).
Genome features and sequence divergence. Chloroplast genomes of the seven species were assembled into single circular, double-stranded DNA sequences, presenting a typical quadripartite structure including one large single-copy region (LSC with 86,948-87,266 bp), one small single-copy region (SSC with 18,427-18,513 bp), and a pair of inverted repeat regions (IR with 52,122-52,235 bp) ( Table 1). The full length ranged from 157,610 in I. latifolia to 157,918 bp in I. wilsonii (Table 1). In I. pubescens, which was investigated in detail as an example, the chloroplast encoded a set of 144 genes of which 100 are unique genes and 16 are duplicated in the IRs regions (Fig. 1). The same gene order and clusters were found in all seven species. The 100 unique genes were composed of 74 protein-coding genes and 26 tRNA genes ( Table 2). All of the eight rRNA genes were duplicated in the IR regions (Table 1). Nine distinct genes (atpF, rpoC1, trnL-UAA, trnV-UAC, rpl2, ndhB, trnI-GAU, trnA-UGC and ndhA) contain one intron and three genes (ycf3, clpP and rps12) have two introns. Gene ycf1 in the junction region between SSC and IRb was the only pseudogene found because of the incomplete duplication of the normal copy in the junction region (Fig. 1).
AT content is rich (62.3-62.4%) and sequence identity among the seven species was 97.9%. The whole aligned sequences disclosed moderate divergence with 11 regions containing sequence similarities below 50%, especially in intergenic regions. Eleven divergent hotspot regions were identified (Fig. 2). The p-distances between Ilex and Helwingia, and among Ilex species, were 0.03988 and 0.00288, respectively, both indicating moderate genetic divergences. The p-distance between the two most closely related species, I. latifolia and I. delavayi in subgenus Ilex section Ilex, was 0.00185. Indels and repeated sequences. A total of 113 indels were detected in the Ilex species, 88 in spacers, 13 in introns of genes, and 12 in genes, with 89 in LSC, 08 in SSC, and 08 in IRb (see Supplementary Table S1). In I. pubescens we identified 34 dispersed repeats > 14 bp with sequence identify > 90%, ranging from 15 bp to 29 bp (Table 3). Most repeats were 16 bp (32.4%) or 17 bp (23.5%). A total of 23 repeats were located in intergenic regions, while 06 were in protein-coding genes and 05 in tRNA genes. Five repeats were identified in ycf2 which was the most in any gene.

IRs region.
Extensions of the IR into the genes rps19 and ycf1 were identified ( Fig. 1): a small part of the 5′ -end of rps19 is in the IRb region and the 5′ -end of ycf1 extended into the IRa region, resulting in its pseudogenization due to the incomplete duplication.
Phylogeny construction. Among the cp genome sequences, protein-coding regions, LSC, SSC, IR, and introns and spacers, introns and spacers had the highest percentage variation at 1.7%, followed by LSC at 1.3%. The IR regions were least variable at 0.1%. The cp genome, SSC, and coding region, were 0.9%, 0.7% and 0.6%, respectively. Different methods of reconstructing phylogenies did not influence the topologic structure ( Fig. 3) except with SSC, where maximum likelihood and Bayesian inference reconstructions differed in the position of I. szechwanensis from the one built by maximum parsimony (Fig. 4). However, in other respects all the phylogenies were the same, with I. latifolia, the new species, and I. delavayi forming one clade and the other species forming another. The cp genome and LSC phylogenies had higher bootstrap values and posterior probabilities than the others.

Discussion
Modifications of chloroplast genome composition and gene order have been identified in many species in the asterid subclass Campanulidae, which includes the Aquifoliales, Asterales, Escalloniales, Bruniales, Apiales, Paracryphiales and Dipsacales. In this study, gene ycf68 was found in Ilex pubescens, but not Helwingia himalaica.
In addition, variability in the extent of the inverted repeat (IR) regions has been found, with the boundaries between IR and LSC or SSC very fluid. Gene rps19 is nearest to the LSC-IR boundary: in some species, like I. pubescens, H. himalaica, and Panax ginseng 21 , it spans the boundary, in others, like Millettia pinnata 25 and Lupinus luteus 26 , it does not extend into the IR, while in others, like Phaseolus vulgaris 27 , Vigna radiata 28 and Vigna unguiculata (JQ755301 in GenBank), the whole gene is inside the IR. Gene ycf1, nearest to the SSC-IR boundary, is similar.
I. pubescens had fewer and smaller dispersed repeats than reported for some other campanulids. The largest was 29 bp, while in the Apiales 9-29 repeats > 30 bp were recorded in various species, with the largest 79 bp in length 23 .
Variable plastid regions have been used to design markers to investigate phylogenetic relationships, for instance rbcL, matK, and atpB, which have been widely used in phylogenic reconstruction from the genus level upwards. However, these genes are most divergent and informative among distantly related species, and are not suited for studying relationships between species in the same genus, like Curcuma 29 and some genera in the Lauraceae 30 . According to the alignment of the cp genome of seven Ilex species studied here, rbcL, matK, and atpB were not appropriate for studies within Ilex because their divergences, which were 0.2%, 0.16% and 0.07%, respectively, were too low. However, 11 divergent regions were identified with sequence divergences around 4%.
Divergent hotspot regions in the chloroplasts are particularly useful for species-level identification in Ilex, which has many similar species represented by few collections.   wugonshanensis C. J. Tseng ex S. K. Chen & Y. X. Feng, are not clearly distinct in morphology and distribution from their nearest relatives, and may not deserve species status. A study of randomly sampled herbarium specimens in the National Herbarium in Beijing found that, although the DNA was usually highly degraded and most fragment < 300 bp, it was still possible to extract usable genetic material from around a third of specimens 18 . This suggests that similar techniques could be used to clarify the diversity and status of the rare and little-known Ilex species in China. For the phylogenetic reconstructions presented here, most informative characters occurred in intergenic regions, with some of these identified as divergent hotspot regions, as was also shown in Camellia 16 and the Bambusoideae 31 . Among the phylogenies built by six different subsets of the genomic data, trees based on the complete cp genome and LSC displayed the highest support, although most nodes had high support in all. The topologies of different phylogenies were very similar, as also shown in Camellia and the Bambusoideae. These studies also showed that the methods used to build the phylogeny (MP, ML or BI) had a relatively minor influence.
The results of our phylogenetic analyses agree in part with the traditional classification system used in the Flora of China 2 . I. delavayi and I. latifolia in section Aquifolium form one clade with a new, large-fruited species, which is similar to but distinct from I. latifolia (Tan et al., submitted). However, in the other clade, the evergreen species I. pubescens and I. wilsonii are in section Pseudoaquifolium, while the deciduous I. polyneura is in Micrococca. In the classification used in the FOC all the deciduous species form a separate clade. Data from nuclear genes will be needed to resolve these differences, as shown previously for Ilex by Manen 1 . The chloroplast genome is also expected to be useful in helping to resolve the deeper branches of the phylogeny as more whole-genome sequences become available.

Conclusions
The chloroplast genomes of seven Ilex species are reported for the first time in this study and their organization is described and compared with that of other campanulids. Eleven divergent regions were identified, which can be used to develop phylogenetic markers. Phylogenies were constructed using the entire genomes and various subsets and their topologies and resolutions compared. Our results will be useful for identification at the species level and for helping to resolve the deeper branches of the phylogeny. Chloroplast genome sequencing and assembly. About 100 mg of fresh leaf material of each species was used to extract total DNA by a modified CTAB method 17,32 , with 4% CTAB with 0.2% DL-dithiothreitol (DTT) replacing 2% CTAB, and adding approximately 1% polyvinyl polypyrrolidone (PVP) while milling the materials. Long-range PCR was used for DNA amplification of the plastome using nine universal primers developed by Yang 17  Raw reads were filtered by quality control software NGSQCToolkit v2.3.3 33 to obtain high quality Illumina data (cut-off value for percentage of read length = 80, cut-off value for PHRED quality score = 30) and vector-and adaptor-free reads. Filtered reads were then assembled into contigs in the software CLC Genomics Workbench 8  (http://www.clcbio.com), by de novo method using a k-mer of 63 and a minimum contig length of 1 kb. Outputted contigs were aligned with a reference Camellia yunnanensis chloroplast genome (Genbank accession number KF156838), which was the most similar genome identified via BLAST (http://blast.ncbi.nlm.nih.gov/), and ordered according to the reference genome. Contigs were aligned with the reference genome for assembly of the chloroplast genome of each species in Geneious 4.8 34 . Lastly, junctions between LSC/IRs and SSC/IRs were validated by Sanger sequencing of PCR-based products using newly designed primers (see Supplementary Table S2).
Genome annotation and repeat analysis. Assembled genomes were annotated using the Dual Organellar GenoMe Annotator (DOGMA) database 35 , then manually edited for start and stop codons. All annotated cp genomes will be deposited in GenBank. Genome maps were drawn in OGDraw 1.2 36 . Multiple sequence alignment was done with MAFFT 5 37 and manually edited where necessary. A comparative plot of full alignment with annotation of these eight genomes was produced by mVISTA 38,39 , using Helwingia himalaica as a reference. REPuter was used to detect and assess repeats 40 in I. pubescens. Average genetic divergences of these eight species were estimated using p-distances. The genetic divergence between the two most closely related species, I. latifolia and I. delavayi, was also estimated. Molecular marker identification. In order to explore the divergence of chloroplast genes in Ilex and its utilization in identification, all coding genes, introns and spacers were extracted. Every homologous region was aligned by MUSCLE 41 and manually edited where necessary.
Phylogenetic analyses. Sequences of the seven Ilex species and Helwingia himalaica were aligned using MAFFT 37 and manually edited where necessary. Unambiguously aligned DNA sequences were used for phylogeny construction. Phylogenies were constructed by maximum parsimony (MP), maximum likelihood (ML) and Bayesian Inference analyses (BI) using the entire cp genome and also using exons of protein-coding regions, introns and spacers, LSC, SSC, and IR. Lengths of all alignment matrices of these datasets are shown in Supplementary Table S3. In all phylogenetic analyses, Helwingia himalaica was used as outgroup. MP and ML analyses were conducted in PAUP 4.0b10 42 . For MP analysis, heuristic searches were conducted with tree bisection-reconnection (TBR) branch swapping, with the 'Multrees' option in effect. Bootstrap analysis was conducted with 1,000 replicates with TBR branch swapping. For ML analysis, the best substitution model was tested according to the Akaike information criterion (AIC) by jModeltest version 2 43,44 (see Supplementary  Table S3). BI analysis was conducted using MrBayes version 3.2.2 45 and the best substitution model tested by AIC. Two independent Markov Chain Monte Carlo chains were calculated simultaneously for 10,000,000 generations and sampled every 1,000 generations. Potential Scale Reduction Factor (PSRF) values were used to determine convergence in Bayesian Inference using MrBayes version 3.2.2 45 . All PSRF values were 1, indicating that these analyses converged. The first 25% of calculated trees was discarded as burn-in and a consensus tree constructed using the remaining trees.   Table 4. Sampled species and their voucher specimens used in this study according to the taxonomic treatment in the Flora of China.