The complete chloroplast genome sequence of tung tree (Vernicia fordii): Organization and phylogenetic relationships with other angiosperms

Tung tree (Vernicia fordii) is an economically important tree widely cultivated for industrial oil production in China. To better understand the molecular basis of tung tree chloroplasts, we sequenced and characterized its genome using PacBio RS II sequencing platforms. The chloroplast genome was sequenced with 161,528 bp in length, composed with one pair of inverted repeats (IRs) of 26,819 bp, which were separated by one small single copy (SSC; 18,758 bp) and one large single copy (LSC; 89,132 bp). The genome contains 114 genes, coding for 81 protein, four ribosomal RNAs and 29 transfer RNAs. An expansion with integration of an additional rps19 gene in the IR regions was identified. Compared to the chloroplast genome of Jatropha curcas, a species from the same family, the tung tree chloroplast genome is distinct with 85 single nucleotide polymorphisms (SNPs) and 82 indels. Phylogenetic analysis suggests that V. fordii is a sister species with J. curcas within the Eurosids I. The nucleotide sequence provides vital molecular information for understanding the biology of this important oil tree.

length 12,13 . Most of the cp genomes contain 110-130 distinct genes, approximately 80 genes coding for proteins involved in gene expression or photosynthesis 10,14 and other genes coding for four rRNAs and 30 tRNAs 15,16 . In addition, most cp genomes have four distinct regions, including a pair of inverted repeats (IRs, 20-28 kb), which are separated by a small single copy (SSC, 16-27 kb) region and a large single copy (LSC, 80-90 kb) region 14,17,18 . The cp genome can be used to investigate molecular evolution and phylogenies 14,19 . Moreover, cp genomes are maternally inherited, which is beneficial in genetic engineering due to lack of cross-recombination 20,21 .
In this study, we determined the complete sequence of the chloroplast genome of tung tree using the PacBio RS II platform. Additionally, we compared it with other known cp genomes aiming to determine phylogenetic relationships among angiosperms.

Results
Genome sequencing, assembly and validation. Using the third-generation sequencing (PacBio RS II System), 18.26 Gb of raw sequence data was generated from tung tree cp genome through 2,910,237 reads with a mean read length of 6,273 bp. The sequence data that satisfied the quality control standards after filtering, were used to construct the cp genome by comparing with the reference cp genomes of other 908 species in NCBI plastid database. The longest recovered subread was 35,889 bp in length and the total amount of recovered subreads was 334 Mb. The depth of average genome coverage of the subreads exceeded 2000X, suggesting that the sequencing data was sufficient to meet the assembly requirements for cp genome. Finally, we obtained 2.4 M high quality reads with a mean read length of 6,762 bp and an N50 contig size of 17,719 bp. The results showed a high consensus of the sequences except 10 different bases between IRa and IRb regions. To ensure the accuracy for the tung tree cp genome, we compared the Sanger results with the assembled genome. The sequence of tung tree cp genome has been deposited in public databases (Genbank accession number: KY628420).
General features of tung tree cp genome. The total length of tung tree cp genome was determined to be 161,528 bp with the circular quadripartite structure similar to major angiosperms cp genomes. The cp genome contains a small single-copy (SSC) region of 18,758 bp and a large single-copy (LSC) region of 89,132 bp, separated by two copies of an inverted repeat (IR) of 26,819 bp ( Fig. 1, Table 1). The genome is structured with 114 unique genes including 81 distinct protein-coding genes, four distinct rRNA genes and 29 distinct tRNA genes ( Table 2). Seven tRNA genes and all of the rRNA genes are duplicated in the IR regions, making a total number of 135 genes (Tables 1 and 2). The genes coding for proteins, rRNA, tRNA, introns, and intergenic spacers (IGSs) are 82,034, 9048, 2742, 17,821, and 52,599 bp, which represent 50.79, 5.60, 1.70, 11.03, and 32.56% of the cp genome, respectively (Table 1). In this cp genome, 16 genes including 5 tRNA genes contain introns structure (Table 2).
Comparison to the cp genomes from other Euphorbiaceae species. The size of the tung tree cp genome was found to be similar to those from Euphorbiaceae family species, J. curcas, H. brasiliensis and M. esculenta (Table 3). However, tung tree cp genome has the longest SSC region (18,758 bp), whereas J. curcas has the shortest SSC region (17,852 bp). Tung tree cp genome contains more genes (135) than other species, and among them 21 genes duplicated in IRs, while 16 genes duplicated in M. esculenta. As shown in Table 3, tung tree has the highest GC content (36.02%), while J. curcas has the lowest GC content (35.36%). Four conserved rRNAs were identified in every species. J. curcas and H. brasiliensis contain 78 coding genes, whereas M. esculenta has 79, and tung tree has 81 coding genes. Tung tree cp genome encodes 29 types of tRNAs, whereas H. brasiliensis and M. esculenta encode 30 (Table 3).
Repeat sequence analysis. The tung tree cp genome encloses 49 repeats with at least 21 base pairs (bp) per repeat unit (Table 4). These repeats include two complementary repeats, 21 direct (forward) repeats, 16 inverted (palindrome) repeats and 10 reverse repeats have 1,504 bp in length, which is about 0.93% of the genome. Most of these repeats are located in intergenic regions, while 8 repeats are located in the introns and in the protein-coding genes including psaA, ycf1 and ycf2.
Simple sequence repeat (SSR) analysis. 81 SSR loci were identified, including 63 mononucleotide SSR loci (77.78%), five dinucleotide SSR loci (6.17%), and 13 other types of SSR loci (16.05%). Among them, there are 79 A or T repeats, one G repeat and one AG dinucleotide repeat. These SSR loci represent 0.937% of the complete cp sequence. 64 of the 81 SSR loci are located in intergenic regions, eight in gene-coding regions, six in intron regions, two between the gene-coding and intron regions, and only one between the intergenic regions and gene codingregions (see Supplementary Table S1). from 274 bp from the IRb/LSC border in tung tree cp genome (Fig. 3). The trnH-GIG sequences are found in LSC regions of all cp genomes. This gene is 209 bp from the IRa/LSC border in tung tree cp genome. The rps19 sequence is detected in the IR regions of tung tree cp genome and 8 bp apart from the LSC/IRb border, whereas this gene is located in the LSC in J. curcas, B. sinica and N. tabacum. In addition, the rps19 gene is observed at the IRb/LSC border of two Euphorbiaceae plants, M. esculenta and H. brasiliensis (Fig. 3).
Phylogenetic Analysis. To analyze the V. fordii phylogenetic position within asterid lineage, we aligned 55 complete cp genome sequences using the 36 protein-coding genes. The species representing 24 orders and included 3 outgroup taxa. The sequence analysis showed a fully resolved phylogenetic tree (12,995 in length of 0.51 for consistency index and 0.65 for retention index) (Fig. 4). The phylogenetic trees generated by ML and MP alignment have similar topologies (Figs 4 and S1). There are a total of 7,609 positions in the final dataset. V. fordii is placed as sister to J. curcas with a bootstrap (96). V. fordii is grouped to Malpighiales with J. curcas. There is a sister relationship among Falales, Cucurbitalesand Rosales.

Discussion
The entire chloroplast genome of tung tree was determined using the third-generation sequencing (PacBio RS II System) method and assembled with the chloroplast genomes of the other Euphorbiaceae plants using the cp genomes of J. curcas and M. esculenta as references. The genome sequence was confirmed by Sanger sequencing of PCR-based products using specific primers (see Supplementary Table S2). As shown in Figure 1, the tung tree cp genome is a typical circle DNA, similar to those from Euphorbiaceae 7, 13, 24 .
Repeat sequences are useful for studying genome rearrangement and play an important role in phylogenetic analysis 25 . There are 49 repeats in the tung tree cp genome. A large number of repeats are distributed within IGS regions and the IRs account for the majority of repeats. In addition, we also find many repeats are present in the ycf2 gene including two forward repeats and four palindrome repeats. The results are similar to those of previous studies on Jatropha curcas 13 , Citrus sinensis 16 and Vitis 26 . Meanwhile, the non-coding regions in cp genomes are important for phylogenetic studies in angiosperms 27 . Most of the repeats are found in the non-coding regions of the tung tree cp genome.
In tung tree cp genome, 81 SSR loci with a length of at least 10 bp were identified (Table S1). All of the dinucleotides are composed of multiple copies of AT/TA repeats, and 75 of them are detected in the noncoding regions. These findings are similar to those of the other published results, i,e., repeats are typically found in the noncoding regions, especially in IGS regions of the cp genomes 17,28,29 . The SSRs in cp genomes was first reported in Pinus radiata 30 . These SSRs can be useful biomarkers for genetic diversity.
The border regions of LSC-IRa, IRa-SSC, SSC-IRb and IRb-LSC represent highly variable regions with many nucleotide changes in cp genomes of closely related species. We compared the IR boundary regions of cp genome from six species in this study. The border of tung tree cp genome is differed slightly from that of other cp genomes. At the IRb and SSC border, the intergenic region of ycf1 and ndhF in tung tree cp genome is larger (274 bp) than those in other species 13 . In addition, the SSC region in tung tree cp genome is also larger than those in other species. The long distance of IRb and SSC border could be a result of the expanding chloroplast genome of tung tree. The rps19 gene of tung tree is entirely located in the IR regions, which is generally located in the LSC region or at the junction of LSC/IRb border in dicotyledons [31][32][33] . Previous studies have shown that rps19 sequence is generally positioned in the IR regions of cp genomes from monocotyledon pineapple (Ananas comosus) 34 , and Chionographis japonica 35 . Our results indicate that the rps19 gene location is similar to monocotyledon. In Euphorbiaceae, though the IR region of tung tree cp genome is shorter than that of J. curcas and M. esculenta, it has more duplicated genes (21 genes) than those of J. curcas (17 genes) and M. esculenta (16 genes). The main reasons for these differences are that the rps19 gene is duplicated in IR regions and that the ycf15 and ycf68 genes are found in tung tree; which are consistent with those results obtained from Hevea brasiliensis 7 and Musa acuminata 36 . Meanwhile, ycf15 and ycf68 genes were identified as pseudogenes in tung tree, and ycf68 sequence is found in the intron regions of trnI-GAU. The similar result has been reported in the cp genome sequence of Pelargonium hortorum 15 . It is reported that cp genomes in most land plants have two identical IR regions, which have lower the nucleotide substitution rates and fewer indels than LSC and SSC regions 37 . Similarly, few indels were identified in the IR regions of tung tree cp genome. IGSs and intron regions have more indels than protein-coding genes and thus evolve more quickly than protein-coding genes. Traditionally the nucleotide substitutions and indels in cp genomes have been used as DNA markers in the phylogenetic analysis of many land plants [38][39][40] .
In the Euphorbiaceae family several studies have analyzed the phylogenetic relationship based on chloroplast DNA sequences 7,13,24 . The phylogenetic evolution of V. fordii were studied here using 36 protein coding genes for 55 plant taxa (Supplementary Table S3), including 52 angiosperms and three outgroup gymnosperms (Ginkgo, Larix and Pinus). We used MP and ML analyses to construct an evolutionary tree involving 55 amino acid sequences. All 52 nodes were resolved well and reliable based on MP bootstrap value: 41 have strong bootstrap support of 95-100% and 11 have moderate support of 60-95%. V. fordii and the other four species in the family Euphorbiaceae are clustered into Malpighiales as a well-supported monophyly and placed within Eurosids  rps2, rps3, rps4, rps7 a , rps8, rps11, rps12 a,b , rps14, rps15, rps16   I, which is similar to pervious work 41 . The phylegenetic tree indicates that subfamily Crotonoideae is a younger, more evolved group than subfamily Acalyphoideae (i.e. Ricinus in this study). However, the deep phylogeny within angiosperms differ from previous research in several ways 42,43 . In our analysis, monocots forms a sister group to the remaining angiosperms, although it is often embedded in dicots in other studies. One possible No.  Table 4. Repeat sequences in the tung tree cp genome. C: complement repeats, F: forward repeats, P: palindrome repeats, R: reverse repeats.

Length (bp)
reason is the heterogeneity between the nuclear and chloroplast genomes 44,45 . There are a few disparities between the MP and ML trees in our analyses. This might be because maximum parsimony is sensitive to incongruent evolutionary rates at internal nodes 46 . In addition, V. fordii is suggested to be more closely related to Jatropha than to Hevea and Manihot.

Conclusion
We presented the first complete nucleotide sequence of tung tree cp genome using PacBio RS II sequencing platforms. The tung tree cp genome (161,528 bp) was fully characterized and compared to the cp genomes of related species. We identified two inverted repeat regions and one small and one large single copy regions. The tung tree cp genome contained 114 unique genes coded for 81 proteins, four ribosomal RNAs and 29 transfer RNAs. Phylogenetic analysis suggests that V. fordii is a sister species of J. curcas within the Eurosids I. Our study provides vital molecular information for understanding of the cp genome of this commercially important woody oil tree.

Material and Methods
Plant materials and DNA sequencing. Tung   Genome assembly and annotation. All sequenced reads were filtered through removing the adapter sequence and cutting off low quality bases in reads and assembled by HGAP 2.3.0 process 48 , Celera assembler (CA) assembled software 49 and OLC assembly algorithm 50 . The cp genome was annotated using Dual Organellar GenoMe Annotator (DOGMA) 51  in NCBI plastid database. The sequence discrepancies between tung tree and Jatropha or Manihot cp genome sequences were validated by PCR amplification and Sanger sequencing. Ten different bases between IRa and IRb regions were also amplified by PCR. PCR were used to verify differences in the sequence of the preliminary cp genome assembly using 29 pairs of forward and reverse primers (see Supplementary Table S2).
Analysis of cp genome sequence. GenomeVx software 53 was used to draw the circular map of the tung tree chloroplast genome. Mauve software 12 and mVISTA program were applied to identify similarities among different cp genomes (http://genome.lbl.gov/vista/mvista/submit.shtml) 54 . REPuter 55 was utilized to identify forward (direct) repeats, reverse sequences, complementary and palindromic sequences with at least 21 bp in length and 90% of sequence identity. The distributions of simple sequence repeats (SSRs) were predicted using the microsatellite search tool MISA 56 . Insertions and deletions (indels), as well as nucleotide substitutions and inversions were scored as single independent characters. The formula (NS + ID)/L × 100 (NS, nucleotide substitutions number; ID, indels number; L, the aligned sequence length) was used to calculate the ratio of mutation events. In addition, the contraction/expansion regions of the inverted repeat (IR) were compared among V. fordii, J. curcas, M. esculenta, H. brasiliensis, B. sinica, and N. tabacum.