Gene losses and partial deletion of small single-copy regions of the chloroplast genomes of two hemiparasitic Taxillus species

Numerous variations are known to occur in the chloroplast genomes of parasitic plants. We determined the complete chloroplast genome sequences of two hemiparasitic species, Taxillus chinensis and T. sutchuenensis, using Illumina and PacBio sequencing technologies. These species are the first members of the family Loranthaceae to be sequenced. The complete chloroplast genomes of T. chinensis and T. sutchuenensis comprise circular 121,363 and 122,562 bp-long molecules with quadripartite structures, respectively. Compared with the chloroplast genomes of Nicotiana tabacum and Osyris alba, all ndh genes as well as three ribosomal protein genes, seven tRNA genes, four ycf genes, and the infA gene of these two species have been lost. The results of the maximum likelihood and neighbor-joining phylogenetic trees strongly support the theory that Loranthaceae and Viscaceae are monophyletic clades. This research reveals the effect of a parasitic lifestyle on the chloroplast structure and genome content of T. chinensis and T. sutchuenensis, and enhances our understanding of the discrepancies in terms of assembly results between Illumina and PacBio.


Chloroplast Genome Structures of T. chinensis and T. sutchuenensis.
show that the chloroplast genome sequence of T. chinensis is a circular molecule that is 121,363 bp in length, which can be divided into a large single-copy (LSC) region of 70,357 bp and a small single-copy (SSC) region of 6,082 bp, and separated by a pair of inverted repeats (IRa and IRb) each 22,462 bp in length (Fig. 1). This sequence, which was assembled using the reads obtained by the Illumina sequencing platform, is 121,363 bp in length. By contrast, the sequence assembled using the reads obtained by the PacBio system is 12 bp shorter than that assembled from the reads obtained by the Illumina platform. After verification using PCR, we found that the complete chloroplast genome of T. chinensis is consistent with the assembly results obtained using the reads from second-generation sequencing. The chloroplast genome of T. sutchuenensis is extremely similar to that of T. chinensis in size and genomic structure; it is 122,562 bp in length and retains a typical structure comprising a LSC (70,630 bp), a SSC (6,102 bp), and two IRs, each having 22,915 bp (Fig. 2). The complete and correct chloroplast genome sequences of T. chinensis and T. sutchuenensis were deposited in GenBank under accession numbers KY996492 and KY996493, respectively.
Data reveal that both species have a GC content of 37.3%, which is unevenly distributed across the whole chloroplast genome. In both cases, the GC content of the IR regions exhibits the highest values across the complete chloroplast genome, 43.0% in T. chinensis and 42.8% in T. sutchuenensis, respectively. This high GC content in IR regions is the result of four rRNA genes (rrn16, rrn23, rrn4. 5 and rrn5) that occur in this region 42 . In addition, after the LSC, which has a GC content of 34.7%, lowest values of 26.2% are seen in SSC regions.
A total of 106 genes were identified in each genome, which include 66 protein-coding genes, 28 tRNAs, 8 rRNAs, and 4 pseudogenes. Simultaneously, we compared these two Taxillus species with autotrophic plants, including Nicotiana tabacum and Osyris alba. Genes encoding subunits of the NAD(P)H dehydrogenase complex (ndh genes) were missing from the chloroplast genome of the two species, whereas three genes for ribosomal proteins (rpl32, rps15, and rps16), seven tRNA genes (trnA-UGC, trnG-UCC, trnH-GUG, trnL-GAU, trnK-UUU, trnL-UAA, and trnV-UAC), four ycf genes (ycf1, ycf5, ycf9, and ycf10), and initiation factor gene (infA) were also lost (Table S1). Two ribosomal protein genes (rpl16 and rpl2) and the duplicate gene ycf15 have also been pseudogenized because their gene-coding regions are interrupted by deletion, insertions or internal stop codons, while the pseudogene rpl2 is located in IRb region. We designed the primers to perform PCR to verify the accuracy of the pseudogenes rpl16, ycf15 and rpl2. The primer sequences are listed in Supplementary Table S2.
The basic information and gene contents of the chloroplast genomes of T. chinensis and T. sutchuenensis compared to other five species are presented in Table 1 and Supplementary Table S1.
Introns play an important role in the regulation of gene expression. Introns enhance exogenous gene expression at specific sites within plants at particular times, resulting in desirable agronomic traits 43 . Introns within these two species are similar to other angiosperms 1,44,45 . Results reveal the presence of nine genes containing introns in each chloroplast genome, including atpF, rpoC1, ycf3, rps12, rpl2, ψrpl16, clpP, petB, and petD. In addition, the SCIEnTIfIC REPORtS | 7: 12834 | DOI:10.1038/s41598-017-13401-4 ycf3 gene and rps12 gene each contain two introns and three exons. The ycf3 gene is located within the LSC, as seen in Metasequoia glyptostroboides 45 , Aquilaria sinensis 46 , while the rps12 gene is specialized for trans-splicing. The 5′ exon is located in the LSC, and the 3′ exon is located in the IR, as is the case in Panax ginseng 44 , C. deserticola 1 , and L. squamaria 23 . Relevant lengths of exons and introns are listed in Table 2.
Comparative genome analyses. Data plotted using mVISTA (Fig. S1) reveal that non-coding regions of the chloroplast genomes of the two Taxillus species are more divergent than their coding counterparts. Moreover, the two IR regions have lower sequence divergence than the LSC and SSC regions. Similar results were obtained in previous research on the complete chloroplast genomes of five Lamiales species 42 as well as in a comparative study of five Epimedium chloroplast genomes 47 . In the present study, rpl16 gene is the most divergent of the coding regions, probably because of pseudogenization. Thus, we conducted a series of linear rearrangement comparisons across the complete chloroplast genome sequences of six species (T. chinensis, T. sutchuenensis, S. jasminodora, V. minimum, O. alba and N. tabacum) aligned in Geneious using the Mauve algorithm (Fig. 3). The comparisons reveal the presence of two structural variants, including an approximately 24-kb-long inversion within the LSC region of the V. minimum chloroplast genome and an approximately 3-kb-long inversion in the SSC region of the O. alba chloroplast genome, which is consistent with a previous report 24 . The lengths of the IR regions in our two species are also similar to that of other plants, with the exception of S. jasminodora 48 where they are much shorter (at least 10 kb) than the length of the IR regions of the five species considered here, including T. chinensis and T. sutchuenensis. Codon Usage. The calculations for the codon usage of protein-coding genes within T. chinensis and T. sutchuenensis chloroplast genomes are summarized in Fig. 4 and Supplementary Table S3. Results reveal the presence of 63 codons encoding 20 amino acids within the chloroplast protein-coding genes of these two species; of these,     Table S3). The data presented in Fig. 4 illustrates that the RSCU value increases with the quantity of codons that code for a specific amino acid. High codon preference, especially a strong AT bias in codon usage, is very common in other land plant chloroplast genomes 42,44 . The present results are similar to the chloroplast genomes of A. sinensis 46 and species within the genus Ulmus 49 in terms of codon usage.
Simple Sequence Repeats (SSRs) Analyses. SSRs are ubiquitous throughout genomes and are also known as microsatellites. SSRs comprise tandem repeated DNA sequences that consist of between one and six repeat nucleotide units 50 . As such, SSRs are widely used as molecular markers in species identification, population genetics, and phylogenetic investigations because they exhibit high levels of polymorphism [51][52][53] . In total, 195 and 198 SSRs are identified within the chloroplast genomes of T. chinensis and T. sutchuenensis, respectively (Table 3; Supplementary Tables S4-S5), which mainly comprise mononucleotide repeats encountered 146 and 139 times in each case. In addition, A/T mononucleotide repeats (93.9% and 96.4%, respectively; Table 3) are the most common, while the majority of dinucleotide repeat sequences comprise AT/TA repeats (59.5% and 67.3%, respectively; Table 3). Results show that SSRs within the chloroplast genomes of T. chinensis and T. sutchuenensis are dominated by AT-rich repetitive motifs, which is consistent with the fact that AT content is also very high (62.7%) in these species. This result is also in agreement with previous studies showing higher proportions of polyadenine (polyA) and polythymine (polyT) relative to polycytosine (polyC) and polyguanine (polyG) within the chloroplast SSRs in many plants 6 .
Phylogenetic Analyses. Phylogenetic trees were constructed using two methods based on two datasets from different species (Fig. 5). Results revealed extremely similar tree topologies from each dataset irrespective of the method used, as supported by high bootstrap values. All nodes in our maximum likelihood (ML) and neighbor-joining (NJ) trees based on 54 protein-coding genes have 100% bootstrap support values, whereas four out of six nodes that received bootstrap values of ≥99% were recovered in both sets of trees when matK genes were used for analyses. All nodes in all phylogenetic trees received higher than 50% bootstrap support. All four phylogenetic trees showed that T. chinensis and T. sutchuenensis are sister taxa with respect to S. jasminodora (Olacaceae), whereas the three species within genus Viscum group with Osyris alba (Santalaceae) and all Santalales species are clustered within a lineage distinct from the outgroup.

Discussion
Numerous variations occur in the chloroplast genomes of parasitic plants. To date, however, most investigations on these genomes in parasitic and heterotrophic plants focused on nonphotosynthetic species 24 . For instance, some complete chloroplast genomes of holoparasitic plants from Orobanchaceae were reported 1,21,22 . A small number of hemiparasitic plants within Santalales and other groups have been studied 24,48 . In this study, the complete chloroplast genomes of T. chinensis and T. sutchuenensis from Santalales were assembled, annotated, and analyzed.  Table 2. Genes with introns in the chloroplast genomes of T. chinensis and T. sutchuenensis as well as the lengths of the exons and introns.    www.nature.com/scientificreports/ Gene loss events have occurred within the chloroplast genomes of most parasitic plants and in a handful of autotrophic species 21,54 . Previous work has shown that genes of chlB, chlL, chlN, and trnP-GGG have been lost from the chloroplast genomes of most flowering plants 55 , whereas gene infA, which codes for a translation initiation factor, is either missing or has been transferred in many plants 56 , including the two species observed in this study. All ndh genes have been lost from the chloroplast genomes of T. chinensis and T. sutchuenensis, similar to the case of Cuscuta gronovii and C. obtusiflora 18,19 . Similarly, nine out of eleven ndh genes have been pseudogenized within the chloroplast genome of L. squamaria 23 . This degree of ndh gene degradation is not only observed in heterotrophic organisms, but also in many autotrophic plants, including Orchidaceae 57,58 , Geraniaceae 59 and Cactaceae 60 . Kim et al. reported that losses of ndh genes in angiosperms are usually associated with nutritional status and/or extensive rearrangements of chloroplast structures 57 . Also, ndh genes loss events and pseudogenization have occurred in reported chloroplast genomes of parasitic plants, regardless of the degree of degradation in photosynthetic capacity 1,23,24,48 . As a result, studies suggested that ndh genes were first lost in the transformation from autotrophy to heterotrophy 18,61 .
In this study, seven transfer RNA genes, including trnK-UUU, have been lost from the chloroplast genomes of both species. Although similar tRNA losses have commonly occurred in most plants (Supplementary Table S1), the trnK-UUU gene, which is generally absent from most parasitic plants, is completely preserved (including its intron matK gene) within the chloroplast genome of Cistanche deserticola 1 . Li et al. suggested that tRNA genes from the chloroplast genome were lost later than photosynthesis genes 1 .
A pseudogene, which is a defective copy of the functional gene, is widespread in the chloroplast genome of plants and has lost the normal protein coding function 1,18,19,62 . Loss of genic normal activity is generally caused by mutations inhibiting gene expression. Pseudogenes not only demonstrate gene mutation accumulation but are also associated with gene expression and regulation 63 . Four pseudogenes exist in the chloroplast genomes of T. chinensis and T. sutchuenensis; these pseudogenes include rpl16, rpl2 and ycf15 (duplicate gene). The gene of ycf15 has been pseudogenized in many plants, including S. jasminodora 48 , C. reflexa 19 and C. exaltata 18 . Genes of rpl16 and rpl2 exist in most plants as functional genes, whereas they have been pseudogenized in the current study.
A previous study pointed out that one early response of the chloroplast genomes to the evolution of a parasitic lifestyle was condensation via losses in numerous non-coding and unimportant regions; this event resulted in reduction of chloroplast genome size 19 . Although gene loss can be regarded as a terminal evolutionary step, accumulation of point mutations leading to pseudogenization nevertheless occurred at previous steps 24 . Second-generation sequencing technology provides an efficient, novel, and rapid method for whole-genome sequencing 12,64,65 . SMRT sequencing, which is combined with circular consensus sequencing (CCS), provides multiple reads of individual templates 40 . Wu et al. 66 have compared three generations of sequencing technologies (Sanger, Illumina and PacBio) on chloroplast genome assembly. Results demonstrated that long reads from PacBio showed potential for highly accurate "finished" genomes. However, the accuracy between second-generation and third-generation sequencing platforms was not compared thoroughly. In the present study, the complete chloroplast genome sequence of T. chinensis was sequenced using Illumina and PacBio platforms. Discrepancies in terms of assembly results between Illumina and PacBio were detected using PCR-based conventional Sanger sequencing, and the quality is very high. Results revealed that in PacBio platform, the error rate is high in homopolymers when the number of repeat units of a mononucleotide is higher than or equal to six. In the chloroplast genome of T. chinensis, polyA/T and polyC/G (repeat higher than or equal to six) included 509 and 72 sites, respectively. Although A/T mononucleotide repeats are the most common types (Table 3), these errors are mainly present in structures of polyC and polyG (Table 4). Among the 14 errors, 12 were G/C deletions, 1 was A/T deletion, and 1 was A/T insertion. All errors differed in terms of only one base.
As a result of multiple comparisons (Table 1 and Fig. 3), we observed that complete lengths of the chloroplast genomes of T. chinensis and T. sutchuenensis are similar to those of S. jasminodora and V. minimum, whereas the lengths of SSC regions are much smaller (at least 3 kb). These regions, which contain most ndh genes, also encapsulate the largest variation within the chloroplast genome 67 and have undergone dramatic reductions in some parasitic plants, including L. clandestine 68 . Previous studies have demonstrated that positions of IR junction and SSC region are correlated with degeneration of ndhF and ycf1 genes 57,69 . Loss of ycf1 and all ndh genes (including ndhF), as revealed by this study may explain why SSC chloroplast genome regions of the two considered species are shorter than those of others.
Chloroplast genomes have provided significant data for evolutionary, taxonomic, and phylogenetic studies 46 . Specifically, the chloroplast gene of matK has been widely utilized in plant phylogenetic analyses 70,71 . In this study, we constructed phylogenetic trees using ML and NJ methods based on matK and 54 protein-coding genes commonly present in the chloroplast genomes of ten species, including two medicinal hemiparasites in the current study. Phylogenetic results are extremely consistent, irrespective of method and dataset. All phylogenetic results strongly support the theory that Loranthaceae and Viscaceae diverged independently from one another. Phylogenetic results discussed in the present study are broadly consistent with those of a previous research, which utilized chloroplast trnL intron sequences to investigate inter-familial relationships within Santalales 30 .

Conclusions
The complete chloroplast genome sequences of traditional medicinal hemiparasites T. chinensis and T. sutchuenensis were obtained and analyzed. Results of this study revealed effects of parasitic lifestyle on chloroplast structure and genome content in these species and enhanced understanding of phylogenetic positions and relationships of T. chinensis and T. sutchuenensis. This research also showed that sequences assembled using reads obtained by the Illumina platform is more accurate than those from PacBio.   Chloroplast Genome Assembly. Low-quality reads resulting from all samples were trimmed using the software Trimmomatic 72 . The trimmed reads included a mixture of data from nuclear and organelle genomes. We used the chloroplast genome sequence of Viscum minimum, which was downloaded from GenBank to establish a Basic Local Alignment Search Tool (BLASTn) database. Then all trimmed reads were mapped onto this database, and the mapped reads were extracted from raw data based on coverage and similarity. Extracted reads were assembled to contigs using SOAPdenovo2 73 . SSPACE 74 was used to construct the scaffold of the chloroplast genome, and GapCloser 73 was used to fill gaps. Reads sequenced using PacBio system were used to assemble the chloroplast genome according to the strategy described by Xiang et al. 41 . Assembly results obtained using Illumina and PacBio (

Codon Usage and SSRs
Analyses. RSCU value, the ratio between frequency of use and expected frequency of a particular codon, is a simple method for detecting non-uniform synonymous codon usage (SCU) within a coding sequence 82 . In the present study, utilizing the RSCU ratio, we performed statistical analyses to investigate the distribution of codon usage with the software CodonW (http://codonw.sourceforge.net/), applying a 1.00 value for no preference. In addition, a value less than 1.00 refers to a frequency of use that is less than expected, whereas a value higher than 1.00 indicates codons that are more frequently used than expected. Potential SSRs were exploited using the software MISA (http://pgrc.ipk-gatersleben.de/misa/), with parameters set to encompass the number of repeat units of a mononucleotide SSR higher than or equal to eight; followed by higher than or equal to four repeat units for di-and tri-nucleotide SSRs; and higher than or equal to three repeat units for tetra-, penta-and hexa-nucleotides, respectively. In this study, we mainly searched for complete repetitive SSR loci, treating cycled or reverse complementary SSRs as the same type.

Phylogenetic Analyses.
To determine phylogenetic positions of T. chinensis and T. sutchuenensis within Santalales, we analyzed the chloroplast genomes of ten species, encompassing five other taxa within this lineage, V. album (accession number: KT003925), V. crassula (KT070881), V. minimum (KJ512176), O. alba (KT070882), and S. jasminodora (KX775962). We also used the chloroplast genomes of P. ginseng (AY582139), N. tabacum (Z00044), and Astragalus mongholicus (KU666554) as outgroups, and constructed phylogenetic trees using ML and NJ methods in the software MEGA 6.0 79 with 1000 bootstrap replicates employing 54 protein-coding genes commonly present in the ten species and matK genes. ML analysis was conducted based on the Tamura-Nei model using a heuristic search for initial trees. This most appropriate model was determined by Modeltest 3.7 83 . NJ trees were performed with NJ method 84 , and evolutionary distances were computed using the Kimura 2-parameter method 85 .