The implication of plastid transcriptome analysis in petaloid monocotyledons: A case study of Lilium lancifolium (Liliaceae, Liliales)

Transcriptome data provide useful information for studying the evolutionary history of angiosperms. Previously, different genomic events (i.e., duplication, deletion, and pseudogenization) were discovered in the plastid genome of Liliales; however, the effects of these events have not addressed because of the lack of transcriptome data. In this study, we completed the plastid genome (cpDNA) and generated transcriptome data of Lilium lancifolium. Consequently, the cpDNA of L. lancifolium is 152,479 bp in length, which consists of one large single copy (81,888 bp), one small single copy (17,607 bp), and two inverted repeat regions (26,544 bp). The comparative genomic analysis of newly sequenced cpDNA and transcriptome data revealed 90 RNA editing sites of which two positions are located in the rRNA coding region of L. lancifolium. A further check on the secondary structure of rRNA showed that RNA editing causes notable structural changes. Most of the RNA editing contents are C-to-U conversions, which result in nonsynonymous substitutions. Among coding regions, ndh genes have the highest number of RNA editing sites. Our study provided the first profiling of plastid transcriptome analyses in Liliales and fundamental information for further studies on post-transcription in this order as well as other petaloid monocotyledonous species.


Results
plastid genome of Lilium lancifolium. The complete plastid genome sequence of L. lancifolium (accession number MH177880; Fig. 1A) in this study is 152,479 bp in length and composed of a large single copy (LSC; 81,888 bp), a small single copy (SSC; 17,607 bp), and two inverted repeat regions (IR; 26,492 bp). In comparison with the previously completed cpDNA of L. lancifolium from China (accession number KY748297) and Korea (accession number KY940844), of which the length of cpDNA is identical (152,574 bp), the gene contents and orders are similar among the three individuals. However, the percentage of the identity of new cpDNA in L. lancifolium in this study is 99.8% and 99.9% compared to counterparts from Korea and China, respectively. Also, the translation initiation factor IF-1 (infA) gene was not annotated in previous data, but it was predicted as a pseudogene in this study because of the presence of internal stop codons within the coding region. Additionally, the different length of poly A sequence after start codon caused two types of cemA gene. The first type is functional cemA in L. lancifolium from China, of which 9-bp-poly A sequence was found. In contrast, the malfunctioning cemA was annotated in cpDNA from Korea counterparts, of which 10-bp and 11-bp-poly A sequences were found and caused internal stop codons in the coding region. The IGS regions among three cpDNA sequences of L. lancifolium showed a high similarity (over 95%), except the ISG regions of petA-psbJ, ndhF-rpl32, and ccsA-ndhD with similarity of 84.2%, 93.95%, and 94.83%, respectively (Table 1).
RNA editing sites and their potential effects. The mapping results of transcription data to complete cpDNA of L. lancifolium revealed 90 editing sites, which located unequally among genes groups (Fig. 1B). The ndh genes possess the highest number of editing sites (29 sites) followed by the RNA polymerase genes (16 sites). The Rubisco gene (rbcL) has only one RNA editing site within its coding region. In contrast, ndhB gene has the highest number of editing sites (10 sites). The most abundant content of RNA editing in L. lancifolium is C-to-U conversion (Table 2). However, the U-to-C conversion was also found in rpl36 and rrn23. Most RNA editing resulted in nonsynonymous substitution, of which the changes from S (Serine) to L (Leucine) is the most frequent (25 sites), followed by S (serine) to F (Phenylalanyl) with 17 sites. Nevertheless, 12 out of 90 editing sites resulted in synonymous substitution (Table 2). Additionally, a total of 32 editing sites were also found in IGS regions of L. lancifolium cpDNA (Supplementary Table 1). In comparison to the previous complete cpDNA of L. lancifolium, the infA and cemA were annotated as pseudogenes because of the presence of internal stop codons within the coding regions. However, the transcriptome data revealed that there are no significant RNA editing sites within the coding region of these two genes. The RNA editing occurred not only in protein-coding genes but also in rRNA (Table 2). Specifically, the U-to-C conversion was found in rrn23S, whereas the C-to-U conversion was recorded in rrn5S. A further check on the predicted structure of rrn5S showed that the editing event affected the structure of rrn5S (Fig. 2). The RNA expression level was also compared among protein-coding genes of L. lancifolium cpDNA ( Table 3). The results showed that the psbA is the most expressed gene followed by rbcL and petB. Although the ndh genes have the highest number of RNA editing sites, their expression level is lower than other genes ( Table 3). The petL has the lowest expression level.

Discussion
Previously, most of the plastid genome studies used the data of one individual as the representative of that species and focused on a comparative genomic analysis with closely related taxa 15,16,18 . This approach has not been fully providing detailed information on the diversification of cpDNAs within a species. Recently, Shi et al. 20 reported 11 complete cpDNAs of both cultivated and wild watermelon. The cpDNAs of three individuals of each species were completed and compared to others. They showed that although the gene number and order are identical among examined species, the wild watermelon exhibited a significant change in the plastid genome size. In this study, the newly sequenced cpDNA revealed both conserved and diverse trends in comparison with previously published cpDNA of L. lancifolium. In fact, the cpDNA of L. lancifolium in this study is longer (105 bp) than those in previous studies which have an identical length (152,574 bp). Additionally, the Korean L. lancifolium has nonfunctional cemA which was found as a functional gene in Chinese counterpart. These findings suggested the interspecific diversification of plastid genome among wild plants and the need for further studies on this issue. Additionally, www.nature.com/scientificreports www.nature.com/scientificreports/ low similarities in IGS of petA-psbJ, ndhF-rpl32, and ccsA-ndhD suggest these regions as hot-spot sites for further studies on evolution of L. lancifolium and related species (Table 1).

KY940844-Korea
RNA editing plays an important role during the post-transcriptional process because it alters the coding content of the genes by two pathways of insertion/deletion and conversion/substitution. Specifically, the C-to-U conversion altered a serin to phenylalanine codon in psbF mRNA of Spirodela polyrhiza 6 . Also, the formation of translation initiation, or internal stop codon, caused by RNA editing, was reported 9-11 . A similar trend was found in the transcriptome data of L. lancifolium ( Table 2). The changes of amino acid composition among genes were mainly caused by C-to-U conversion. Also, the formation of the start codon in rpl36 and ndhD genes resulted from C-to-U conversion at the second position in the start codon. In L. lancifolium cpDNA, infA and cemA genes were annotated as pseudogenes and expected to be corrected by RNA editing process. However, there are no RNA editing sites in mRNA of these two genes. Previously, the loss of infA in cpDNA was recorded in many plants 21 and compensated by the nuclear infA. In fact, mRNA of nuclear infA was found by assembling transcriptome data to infA gene of Pheonix dactylifera (GenBank Accession XM_008784933). In contrast, the case of cemA needs further investigations. Among the four examined monocots, there are different numbers of RNA editing sites (Supplementary Table 2). Most of RNA editing sites resulted in nonsynonymous substitutions, except for Phalaenopsis aphrodite subp formosana of which more than half substitution is synonymous (Supplementary Table 2). However, most of the editing content is C to U in all examined taxa. Notably, the RNA editing in Deschampsia antarctica revealed the changes from G to A, G to C, A to G, A to C, A to U, and U to G 22 .  www.nature.com/scientificreports www.nature.com/scientificreports/ The transcriptome data revealed different expression level of genes in L. lancifolium cpDNA (Table 3). In comparison with RNA expression in D. antarctica, a member of the grass family, there are differences among expression level between L. lancifolium and D. antarctica 22 . One possible explanation could be the different habitat environment. In fact, D. antarctica adapted to the harsh environment of Antarctica whereas L. lancifolium distributes in temperate regions. Additionally, the gene expression is different during development stages of plants.
In this study, we used only the leaf tissue at the growth stage of L. lancifolium. Therefore, further studies of transcriptome of various tissues at different development stages should be conducted to explore the overall trend of expression in L. lancifolium.
Previously, Chen et al. 7 reported the effect of RNA editing on stabilizing the secondary structure of trnM in Phalaenopsis aphrodite subsp formosana. In this study, RNA editing site resulted in changes in the secondary structure of rrn5S. These data revealed the effect of RNA editing process on the stability of rRNA and tRNA. Although RNA editing sites have been recorded in protein coding regions of plastid genomes, studies on the secondary structure of these changes have not been fully conducted. Therefore, further investigations should be done to give a deeper understanding of this issue.
To sum up, the first plastid genome analysis of L. lancifoliun provided fundamental information for further studies on post-transcription events in Liliales. In fact, the RNA editing process is not able to reverse the pseudogenization caused by genomic events in plastid genomes of L. lancifolium. In this study, only leaf tissue was used for plastid transcriptome study, which does not fully reflect the evolutionary information in L. lancifolium. Therefore, transcriptome data of other tissue should be generated to trace the evolutionary history in L. lancifolium and other species of Liliales.

Materials and Methods
plant materials, total DNA extraction, and RNA isolation. The mature fresh leaves of Lilium lancifolium during the growth stage were collected in its wild habitats. The specimen of L. lancifolium was made and deposited to Gachon University Herbarium. For DNA extraction, the fresh leaves were dried in silica gel before extraction steps with Plant DNeasy Mini Kit (Qiagen, Korea). To isolate total RNA, the fresh leaves were immediately put in liquid nitrogen after collected. Then, they were stored in the cold condition until being used for RNA isolation, which was conducted using Plant RNeasy mini Kit (Qiagen, Korea). Both DNA extraction and RNA isolation were conducted based on manufacturer's instruction. The quality of DNA and RNA were tested using gel electrophoresis and one spectrophotometer. www.nature.com/scientificreports www.nature.com/scientificreports/ NGS generation, genome assembly, RNA editing determination and prediction of rRNA structure. To generate NGS data, the total DNA and RNA from leaves of L. lancifolium were applied to Illumina Hiseq 200 and Nextseq 500, respectively. First of all, the DNA and RNA were fragmented. Then, newly fragmented DNA and RNA were hybridized and ligated with adapters. In the next step, PCR amplification was employed to create the sequence library. Finally, the library was sequenced and resulted in the DNA-NGS data of 301 bp in length and transcriptome data of 76 bp in length. The DNA-NGS data were imported to Geneious program for further analysis 23 . The reads were trimmed with more than 5% chance of an error per base before being assembled to reference cpDNA of Lilium lancifolium (Accession number KY940844) with similarity over 95% between reads and reference sequence. Consequently, there are 3,852,736 reads of which 17,473 reads (0.45%) were assembled to reference with coverage of 34.5x. The newly completed cpDNA of L. lancifolium was annotated and manual adjusted in Geneious program. The map of cpDNA was illustrated by OGDraw 24 with manual modification. The new cpDNA in this study was aligned with previously reported cpDNA of L. lancifolium (Accession number KY748297 and KY940844) using MAUVE alignment embedded in Geneious to identify hot-spot regions 25 . Also, the newly assembled cpDNA (GenBank Accession number MH177880) was used for identifying RNA editing sites. The RNA sequence data were imported to Geneious and aligned to cpDNA of L. lancifolium using Bowtie 2.0 with mismatch ≤2 26 . The filtered reads (26,824,116 out of 53,643,506 reads) were then analyzed using Cufflinks to calculate the read per kilobase million (RFKM) and TopHat for variants calling 27 . The determination of RNA editing sites was based on the division of reads with editing based on the total reads of that position. If the frequency of C-to-U or U-to-C conversion was over 5%, that position was recognized as an RNA editing sites as described in previous study 28 . An online tool (http://rna.tbi.univie.ac.at/) was used to predict the second structure of rRNA 29 . Transcriptome data of L. lancifolium was deposited to NCBI (SRA accession SAMN08940087).