Introduction

In the genomic era, besides the nuclear, mitochondrial, and plastid genomes, transcriptome data also provide useful information for exploring the evolutionary history of angiosperms. Previously, different transcriptome studies of model plants by applying next-generation sequencing method (NGS) were reported, including Arabidopsis thaliana, rice, sugarcane, and so on1,2,3,4,5. These data added a deeper understanding of the gene expression and biological mechanisms that allow plants to survive and adapt to the environment. For example, transcriptome data revealed the mechanism that showed how sugarcane responded after being infected by bacteria or a virus3,5. Also, transcriptome data suggested genes that are responsible for drought and salinity tolerance in rice4. The transcriptome analysis was not only reported for model plants but also wild species. For instance, RNA editing sites were observed in Spirodela polyrhiza and Phalaenopsis aphrodite subsp formosana6,7. Also, the mechanism of cold responses in Lilium lancifolium was profiled from transcriptome data8. In plants, RNA editing resulted from the modification of nucleotide sequences. This process has a significant effect on gene expression because it can cause the presence of an internal stop codon9,10,11. In fact, the effect of RNA editing on the metabolism of plants was reported12. These studies suggested the necessity of transcriptome data in studying the evolution in angiosperms.

Liliales is a member of petaloid monocotyledons and consists of 10 families of 1558 taxa13. The plastid genomes of this order have been intensively studied, and different genomic events (i.e., inversion, deletion, duplication, and pseudogenization) were discovered14,15,16,17,18,19. For example, various stages of rps16 deletion were recorded in the tribe Melanthieae (Melanthiaceae, Liliales)19. Also, duplication events were found in the Paris  species16,18. Although genomic events were reported, their effects on the post-transcriptional process are unclear because of the lack of plastid transcriptome data in Liliales. Therefore, in this study, we conducted the first plastid transcriptome analysis in Liliales. First of all, Lilium lancifolium was selected as the target species. Then, complete plastid genome (cpDNA) of L. lancifolium was sequenced using Next-generation sequencing method. Based on the new RNA editing data, we test the hypothesis of whether the pseudogenization can be reversed in L. lancifolium. We also check the effect of RNA editing site on the secondary structure of rRNA.

Results

Plastid genome of Lilium lancifolium

The complete plastid genome sequence of L. lancifolium (accession number MH177880; Fig. 1A) in this study is 152,479 bp in length and composed of a large single copy (LSC; 81,888 bp), a small single copy (SSC; 17,607 bp), and two inverted repeat regions (IR; 26,492 bp). In comparison with the previously completed cpDNA of L. lancifolium from China (accession number KY748297) and Korea (accession number KY940844), of which the length of cpDNA is identical (152,574 bp), the gene contents and orders are similar among the three individuals. However, the percentage of the identity of new cpDNA in L. lancifolium in this study is 99.8% and 99.9% compared to counterparts from Korea and China, respectively. Also, the translation initiation factor IF-1 (infA) gene was not annotated in previous data, but it was predicted as a pseudogene in this study because of the presence of internal stop codons within the coding region. Additionally, the different length of poly A sequence after start codon caused two types of cemA gene. The first type is functional cemA in L. lancifolium from China, of which 9-bp-poly A sequence was found. In contrast, the malfunctioning cemA was annotated in cpDNA from Korea counterparts, of which 10-bp and 11-bp-poly A sequences were found and caused internal stop codons in the coding region. The IGS regions among three cpDNA sequences of L. lancifolium showed a high similarity (over 95%), except the ISG regions of petA-psbJ, ndhF-rpl32, and ccsA-ndhD with similarity of 84.2%, 93.95%, and 94.83%, respectively (Table 1).

Figure 1
figure 1

The map of plastid genome and number of RNA editing sites in different gene groups. (A) The map of plastid genome of Lilium lancifolium. Genes shown outside and inside of the outer circle are transcribed counter clockwise and clockwise, respectively. The dark gray area in the inner circle indicates the CG content of the chloroplast genome. The colors represent different groups of genes in cpDNA. LSC: Large single copy; SSC: small single copy; IRA: inverted repeat region A; IRB: inverted repeat region B. (B) The number of RNA editing sites in different gene groups. A: Rubisco; B: ATP dehydrogenase subunit P; C: Ribosomal RNAs; D: Cytochrome b6/f; E: Hypothetical proteins; F: ATP synthase; G: Miscellaneous proteins; H: Large and small subunit ribosomal proteins; I: Photosystem I and II; J: RNA polymerase; K: NADH oxidoreductase.

Table 1 Pairwise identity of IGS region among three complete cpDNAs of Lilium lancifolium.

RNA editing sites and their potential effects

The mapping results of transcription data to complete cpDNA of L. lancifolium revealed 90 editing sites, which located unequally among genes groups (Fig. 1B). The ndh genes possess the highest number of editing sites (29 sites) followed by the RNA polymerase genes (16 sites). The Rubisco gene (rbcL) has only one RNA editing site within its coding region. In contrast, ndhB gene has the highest number of editing sites (10 sites). The most abundant content of RNA editing in L. lancifolium is C-to-U conversion (Table 2). However, the U-to-C conversion was also found in rpl36 and rrn23. Most RNA editing resulted in nonsynonymous substitution, of which the changes from S (Serine) to L (Leucine) is the most frequent (25 sites), followed by S (serine) to F (Phenylalanyl) with 17 sites. Nevertheless, 12 out of 90 editing sites resulted in synonymous substitution (Table 2). Additionally, a total of 32 editing sites were also found in IGS regions of L. lancifolium cpDNA (Supplementary Table 1). In comparison to the previous complete cpDNA of L. lancifolium, the infA and cemA were annotated as pseudogenes because of the presence of internal stop codons within the coding regions. However, the transcriptome data revealed that there are no significant RNA editing sites within the coding region of these two genes. The RNA editing occurred not only in protein-coding genes but also in rRNA (Table 2). Specifically, the U-to-C conversion was found in rrn23S, whereas the C-to-U conversion was recorded in rrn5S. A further check on the predicted structure of rrn5S showed that the editing event affected the structure of rrn5S (Fig. 2). The RNA expression level was also compared among protein-coding genes of L. lancifolium cpDNA (Table 3). The results showed that the psbA is the most expressed gene followed by rbcL and petB. Although the ndh genes have the highest number of RNA editing sites, their expression level is lower than other genes (Table 3). The petL has the lowest expression level.

Table 2 The number of RNA editing sites among coding regions of plastid genome of L. lancifolium.
Figure 2
figure 2

The predicted secondary structure of rrn5S with (lower) and without RNA editing site (upper). The color radian (from purple to red) means the probability of connection among nucleotides (from 0 to 1). The black arrows indicate the position of RNA editing.

Table 3 RNA expression of protein-coding genes in the L. lancifolium chloroplast genome.

Discussion

Previously, most of the plastid genome studies used the data of one individual as the representative of that species and focused on a comparative genomic analysis with closely related taxa15,16,18. This approach has not been fully providing detailed information on the diversification of cpDNAs within a species. Recently, Shi et al.20 reported 11 complete cpDNAs of both cultivated and wild watermelon. The cpDNAs of three individuals of each species were completed and compared to others. They showed that although the gene number and order are identical among examined species, the wild watermelon exhibited a significant change in the plastid genome size. In this study, the newly sequenced cpDNA revealed both conserved and diverse trends in comparison with previously published cpDNA of L. lancifolium. In fact, the cpDNA of L. lancifolium in this study is longer (105 bp) than those in previous studies which have an identical length (152,574 bp). Additionally, the Korean L. lancifolium has nonfunctional cemA which was found as a functional gene in Chinese counterpart. These findings suggested the interspecific diversification of plastid genome among wild plants and the need for further studies on this issue. Additionally, low similarities in IGS of petA-psbJ, ndhF-rpl32, and ccsA-ndhD suggest these regions as hot-spot sites for further studies on evolution of L. lancifolium and related species (Table 1).

RNA editing plays an important role during the post-transcriptional process because it alters the coding content of the genes by two pathways of insertion/deletion and conversion/substitution. Specifically, the C-to-U conversion altered a serin to phenylalanine codon in psbF mRNA of Spirodela polyrhiza6. Also, the formation of translation initiation, or internal stop codon, caused by RNA editing, was reported9,10,11. A similar trend was found in the transcriptome data of L. lancifolium (Table 2). The changes of amino acid composition among genes were mainly caused by C-to-U conversion. Also, the formation of the start codon in rpl36 and ndhD genes resulted from C-to-U conversion at the second position in the start codon. In L. lancifolium cpDNA, infA and cemA genes were annotated as pseudogenes and expected to be corrected by RNA editing process. However, there are no RNA editing sites in mRNA of these two genes. Previously, the loss of infA in cpDNA was recorded in many plants21 and compensated by the nuclear infA. In fact, mRNA of nuclear infA was found by assembling transcriptome data to infA gene of Pheonix dactylifera (GenBank Accession XM_008784933). In contrast, the case of cemA needs further investigations. Among the four examined monocots, there are different numbers of RNA editing sites (Supplementary Table 2). Most of RNA editing sites resulted in nonsynonymous substitutions, except for Phalaenopsis aphrodite subp formosana of which more than half substitution is synonymous (Supplementary Table 2). However, most of the editing content is C to U in all examined taxa. Notably, the RNA editing in Deschampsia antarctica revealed the changes from G to A, G to C, A to G, A to C, A to U, and U to G22.

The transcriptome data revealed different expression level of genes in L. lancifolium cpDNA (Table 3). In comparison with RNA expression in D. antarctica, a member of the grass family, there are differences among expression level between L. lancifolium and D. antarctica22. One possible explanation could be the different habitat environment. In fact, D. antarctica adapted to the harsh environment of Antarctica whereas L. lancifolium distributes in temperate regions. Additionally, the gene expression is different during development stages of plants. In this study, we used only the leaf tissue at the growth stage of L. lancifolium. Therefore, further studies of transcriptome of various tissues at different development stages should be conducted to explore the overall trend of expression in L. lancifolium.

Previously, Chen et al.7 reported the effect of RNA editing on stabilizing the secondary structure of trnM in Phalaenopsis aphrodite subsp formosana. In this study, RNA editing site resulted in changes in the secondary structure of rrn5S. These data revealed the effect of RNA editing process on the stability of rRNA and tRNA. Although RNA editing sites have been recorded in protein coding regions of plastid genomes, studies on the secondary structure of these changes have not been fully conducted. Therefore, further investigations should be done to give a deeper understanding of this issue.

To sum up, the first plastid genome analysis of L. lancifoliun provided fundamental information for further studies on post-transcription events in Liliales. In fact, the RNA editing process is not able to reverse the pseudogenization caused by genomic events in plastid genomes of L. lancifolium. In this study, only leaf tissue was used for plastid transcriptome study, which does not fully reflect the evolutionary information in L. lancifolium. Therefore, transcriptome data of other tissue should be generated to trace the evolutionary history in L. lancifolium and other species of Liliales.

Materials and Methods

Plant materials, total DNA extraction, and RNA isolation

The mature fresh leaves of Lilium lancifolium during the growth stage were collected in its wild habitats. The specimen of L. lancifolium was made and deposited to Gachon University Herbarium. For DNA extraction, the fresh leaves were dried in silica gel before extraction steps with Plant DNeasy Mini Kit (Qiagen, Korea). To isolate total RNA, the fresh leaves were immediately put in liquid nitrogen after collected. Then, they were stored in the cold condition until being used for RNA isolation, which was conducted using Plant RNeasy mini Kit (Qiagen, Korea). Both DNA extraction and RNA isolation were conducted based on manufacturer’s instruction. The quality of DNA and RNA were tested using gel electrophoresis and one spectrophotometer.

NGS generation, genome assembly, RNA editing determination and prediction of rRNA structure

To generate NGS data, the total DNA and RNA from leaves of L. lancifolium were applied to Illumina Hiseq 200 and Nextseq 500, respectively. First of all, the DNA and RNA were fragmented. Then, newly fragmented DNA and RNA were hybridized and ligated with adapters. In the next step, PCR amplification was employed to create the sequence library. Finally, the library was sequenced and resulted in the DNA-NGS data of 301 bp in length and transcriptome data of 76 bp in length. The DNA-NGS data were imported to Geneious program for further analysis23. The reads were trimmed with more than 5% chance of an error per base before being assembled to reference cpDNA of Lilium lancifolium (Accession number KY940844) with similarity over 95% between reads and reference sequence. Consequently, there are 3,852,736 reads of which 17,473 reads (0.45%) were assembled to reference with coverage of 34.5x. The newly completed cpDNA of L. lancifolium was annotated and manual adjusted in Geneious program. The map of cpDNA was illustrated by OGDraw24 with manual modification. The new cpDNA in this study was aligned with previously reported cpDNA of L. lancifolium (Accession number KY748297 and KY940844) using MAUVE alignment embedded in Geneious to identify hot-spot regions25. Also, the newly assembled cpDNA (GenBank Accession number MH177880) was used for identifying RNA editing sites. The RNA sequence data were imported to Geneious and aligned to cpDNA of L. lancifolium using Bowtie 2.0 with mismatch ≤226. The filtered reads (26,824,116 out of 53,643,506 reads) were then analyzed using Cufflinks to calculate the read per kilobase million (RFKM) and TopHat for variants calling27. The determination of RNA editing sites was based on the division of reads with editing based on the total reads of that position. If the frequency of C-to-U or U-to-C conversion was over 5%, that position was recognized as an RNA editing sites as described in previous study28. An online tool (http://rna.tbi.univie.ac.at/) was used to predict the second structure of rRNA29. Transcriptome data of L. lancifolium was deposited to NCBI (SRA accession SAMN08940087).