Introduction

The chloroplast (cp) is a plant-specific and vital organelle that serves as the site of photosynthesis by converting light energy into chemical energy. This organelle is also involved in other biochemical processes, including sugar, amino acid, lipid, vitamin, starch and pigment syntheses, sulfate reduction and nitrogen metabolism1,2,3. The cp genome is typically circular and 120 to 160 kbp in size. Most cp genomes possess a similar structure. They typically contain a large single-copy region (LSC) and a short single-copy region (SSC) that are separated by two inverted repeats (IRs). The gene order and content of cp genomes are generally highly conserved, and the substitution rate in the cp DNA is less than that in nuclear DNA4,5. On the other hand, given their highly conserved sequences, similar structures and stable maternal heredity, cp genomes are a valuable and ideal resource for plant phylogenetics, population genetics, species identification and genetic engineering6. Gain and loss of function genes in the cp genome cause extensions or contractions, respectively, thus explaining genome size variations, which also reflect species differentiation events1.

Since the complete cp genome sequences of tobacco and liverwort were published7,8, an increasing number of seed plant cp genomes have been sequenced and reported8. With the development of next-generation sequencing (NGS), the cost of sequencing has been reduced, and the technique has become faster. NGS has the advantage of providing extremely high yield and accurate data on complete cp genomes. The number of sequenced cp genomes from various plants is increasing quickly. However, most complete current cp genome studies have focused on seed plants. Although ferns are a major group of plants, there are only 60 ferns for which complete cp genomes have been reported9,10. Other studies about ferns have been based on partial sequences or fragments of cp genomes11.

Dryopteris fragrans (L.) Schott is a perennial fern that grows on the surfaces of rocks and lava. It belongs to the Dryopteridaceae family and is found in the Far East, Europe and North America in small communities12. The Wudalianchi (N48°30′–48°51′, E126°00′–126o25′, Altitude 295–315 m) in Heilongjiang Province marks the centre of its distribution in China13. Its biotope is significantly different from that of other ferns. Most ferns prefer to grow in shady, warm and moist places. If those ferns are placed in dry and hot environments or exposed to ultraviolet (UV) ray for a short time, they quickly become crinkly and wilt or die. However, D. fragrans is exposed to sunlight and the dry surfaces of black rocks and lava directly. At the same time, it must endure strong UV rays and high surface temperatures of approximately 70 °C in summer. Therefore, this is a special species with superior stress resistance in ferns. This fern possesses special mechanisms to help it live in its severe environment. Previous studies of this species have mostly focused on its secondary metabolites and related genes14,15,16,17,18,19,20,21,22,23,24,25,26. Although lots of research has been performed on this species, its ability to survive in severe environments has largely been ignored. In our previous study, we partly sequenced its genome and obtained some contigs of its cp genome, which attracted our attention. Because the cp is the energy transducer for plants, it can induce changes in the external environment directly and react quickly. Studying the complete cp genome of D. fragrans may provide useful information about its superior heat resistance.

Here, we sequenced and annotated the complete cp genome of D. fragrans, and its cp genome features and structures were compared with related species. RNA editing sites were predicted by PREP in protein-coding genes and validated by transcripts. Simple sequence repeats (SSR) and long repeat structures were identified in this cp genome. The long repeat structures were extremely abundant compared with other species, and most were located in the intergenic spacer (IGS) region, which exhibited high GC content repeat structures and which may enhance cp genome stability. The thermal denaturation experiment showed that the D. fragrans cp genome exhibited strong thermal stability. These data would provide useful information and contribute to a better understanding of how this special fern lives in harsh environments. Furthermore, it will also be helpful in the study of secondary metabolism, genetic engineering, physiology and evolution within ferns and other species in the future.

Results

Chloroplast genome assembly and validation

Quantitative real-time polymerase chain reaction (qRT-PCR) result showed that the rbcL was detected in isolated cpDNA samples, while actin 6, the nuclear specific gene, was not detected (Supplemental Fig. 1). It showed that the cpDNA samples were pure. The sequencing run generated 2,740,440 raw reads, totaling 822,132,000 bases, with an average read length of 300 bp from the D. fragrans cp genome. A total of 2,262,910 clean reads with an average read length of 184.56 bp were de novo assembled into 31 contigs. The average sequence coverage depth of each nucleotide on the cp genome was 105X . A maximum scaffold size of 143,707 bp that spanned most of the small and large single copy region (SSC and LSC) and the entire inverted repeat (IR) region was generated. Because the IR region had double the coverage compared with the remaining scaffold, it was used twice in the complete cpDNA sequence. We submitted the annotated cp genome sequence of D. fragrans under GenBank accession number KX418656.2.

Chloroplast genome features and comparison

The cp genome of D. fragrans was 151,978 bp in length, with a typical quadripartite structure (Fig. 1). It included a pair of IRA and IRB of 17,314 bp separated by one SSC and one LSC of 31,947 bp and 85,332 bp, respectively. The D. fragrans cp genome contained 112 genes (Table 1), including 4 ribosomal RNA genes; 26 transfer RNA genes, and 22 genes encoding ribosomal subunits, of which 12 encoded the small subunit and 10 for the large subunit. It also included 3 genes encoding DNA-directed RNA polymerases, and 44 genes dedicated to photosynthesis, of which 11 encoded subunits of the NAD(P)H-quinone oxidoreductase, 4 encoded the photosystem I complex, 13 encoded the photosystem II complex, 6 encoded the cytochrome b/f complex, 6 encoded different subunits of the ATP synthase and 1 encoded the large chain of the ribulose bisphosphate carboxylase (rbcL). Three genes encoded the dark-operative protochlorophyllide oxidoreductase subunits; 5 genes (ycf1, 2, 3, 4, 12) were dedicated to open reading frames; 2 were detected to protease; 1 encoded a translational initiation factor and 5 for other proteins. Among them, 14 genes contained introns, including psaA, atpF, ndhA, ndhB, ndhE, ndhF, ndhG, rpl2, rpl20, rpoB, rpoC1, cemA, clpP and ycf3 (Table 1). Compared with Adiantum capillus-veneris, Pteridium subsp. aquilinum and Cyrtomium devexiscapulae, the D. fragrans cp genome gained orf42, trnR-ACG, rrn5, rrn4.5, rrn23, trnA-UGC and trnN-GUU in SSC. In addition, trnR-ACG, rrn5, rrn4.5, rps12 and trnI-GUG were lost in IRs, and ndhB was truncated in IRA. The psbK and trnG-UCC were lost in IRB-LSC (Fig. 2). The GC content of the cp genome was 43.15%. The IR region was the highest (44.18%), followed by the LSC (42.70%) and the SSC (43.26%). The GC contents in rRNA genes (55.75%) and tRNA genes (55.46%) were higher than those in protein coding regions (44.06%). The comparison of cp genome size, GC content, gene number and order is listed in Table 2. The D. fragrans cp genome did not contain tRNA for the amino acid codons Lys. The codon usage frequency was listed in Table 3. Among these codons, 987 (3.57%) encoding for glutamate and 166 (0.60%) for cysteine, were the most and the least amino acids codons, respectively.

Figure 1
figure 1

The mapped D. fragrans (L.) Schott circular chloroplast genome. Genes presented outside of the outer circle are transcribed in a clockwise manner, and those inside are transcribed in a counter-clockwise manner. Functional categories of genes are colour-coded. The dashed area in the inner circle indicates the GC content of the chloroplast genomes.

Table 1 Genes present in the D. fragrans chloroplast genome.
Figure 2
figure 2

Comparison of gene order and content in the LSC, IR, and SSC regions among four cp genomes. Compared with other species, the IR length of D. fragrans is shorter and its SSC is the longest. The D. fragrans cp genome lost ndhF and gained orf42, trnR-ACG, rrn5, rrn4.5, rrn23, trnA-UGC and trnN-GUU in SSC. The trnR-ACG, rrn5, rrn4.5, rps12 and trnI-GUG were lost in the IR regions, while ndhB was truncated in the IRA. The psbK and trnG-UCC were lost in IRB-LSC.

Table 2 The characters of chloroplast genomes selected from 21 Pteridophyta, 1 Gymnospermae, 1 Monocot and 1 Dicot.
Table 3 Codon usage and codon-anticodon recognition pattern for tRNA in D. fragrans cp genome.

SSRs and repeat structures in the D. fragrans chloroplast genome

The MISA detected 44 SSRs in the D. fragrans cp genome (Supplemental Table 2), including 41 homopolymers and 3 dipolymers. Tetrapolymers, pentapolymers, and hexapolymers were not found. Sixteen SSRs were exclusively composed of A or T bases, 27 SSRs were G or C bases, and 1 was an AG base. Most of the bases were G or C bases, except for the AG dipolymer. All of these SSRs were located in the IGS, and none were located in protein-coding genes. Repeat analysis by REPuter, with the criterion of a copy size of ≥30 bp or longer and a sequence identity ≥90%, identified 80 forward, 1 reverse and 23 palindromic repeat structure pairs from 30 to 55 bp. Repeat lengths of 30 to 32 bp were most common (27.40%). A total of 53 repeat pairs were found in the coding regions, of which 6 were located in the transfer RNA genes. The remaining 151 repeat pairs were located in the IGSs (Supplemental Table 3). In addition, one of the longest repeat structure (55 bp) overlapped with the longest SSR sequence (18 bp G mononucleotide sequence) (Supplemental Table 3). The average GC content of repeat structures was 43.04%, with a maximum of 63.64% and minimum of 30%. To compare the number of repeat structures with that of other fern species, we extracted correlative sequences from 29 ferns to determine the number of repeat structures of different lengths. Repeat structures were abundantly distributed in the D. fragrans cp genome, and this species contained the most repeat structures among ferns (Fig. 3A). Compared with the other 29 genomes, the D. fragrans cp genome had the highest percentage of repeat structures (5.351%). It was higher than the other species by a factor of 1.50 to 21.28 (Fig. 3B). Thus, D. fragrans was rich in repeat structures.

Figure 3
figure 3

Comparisons of repeat structure number and percent within 30 ferns. The sizes of the repeats are set at a repeat minimal length of ≥33 bp and maximal length of ≤55 bp with a Hamming distance of 3. The number and percentage of the repeat structures from 29 ferns were compared with those of D. fragrans. (A) The number of repeat structure in the D. fragrans cp genome was compared to that of 29 ferns. D. fragrans possesses the most repeat structures; (B) The percent of repeat structures in the D. fragrans cp genome was compared to that of 29 ferns. D. fragrans possesses the highest repeat structure percent (5.351%).

RNA editing

The transcripts obtained by PCR were used to verify the RNA editing sites predicted by PREP. The PREP prediction results showed that there were 438 RNA editing sites in protein-encoding genes, corresponding to 338 codon changes. All editing events were of the C-to-U variety. Among them, 96 non-synonymous mutations were found at the first position of the codon, while the remaining mutations were found at the second position, and none were found at the third position. However, in the transcript validation results, there were 345 RNA editing sites in the D. fragrans cp genome (Supplemental Table 4). In all, 88 mutations occurred in the first position of the codon, 208 at the second position, and 49 at the third position. The C-U mutations were the most common, reaching 305 (88.41%) mutations, followed by U-C 13 (3.77%), G-A 8 (2.32%), A-G 7 (2.03%), A-C 3 (0.87%), C-G 2 (0.58%), G-U 2 (0.58%), U-A 2 (0.58%), U-G 2 (0.58%) and G-C 1 (0.29%) (Supplemental Fig. 2). There were 318 codon changes, including 33 synonymous mutations and 285 nonsynonymous mutations. The majority of editing sites were predicted in the ndhF gene (130629-123964, 41 editing sites), followed by the atpB gene (72943-71462, 21 editing sites). The conversions of amino acids included 119 hydrophilic to hydrophobic changes (H to Y, S to L, S to F, and T to M) and 105 hydrophobic to hydrophobic changes (L to F, A to V, P to S, R to W and P to L). The codons that turned into leucine (Leu) were the most common, accounting for 125 changes (39.31%). The number of RNA editing sites (TCA (S)-TTA (L)) were the most frequent (37 editing sites), followed TCG(S)-TTG(L) (28 editing sites) and CCA(P)-CTA(L) (21 editing sites).

Comparison of cpDNA thermal stability

To confirm that repeat structures with high GC content contributed enhanced the cp genome stability of D. fragrans, the thermal denaturation for all species cp genomes were completed. In the denaturation experiment, the absorbance of all samples increased with elevated temperatures (Fig. 4). The percentage increase of all ferns was under or approximately 20% at 35 °C. Some rose quickly, such as Arabidopsis, wheat, and T. palustris. Most samples began to go up quickly at 35 °C and sharply rose at 55 °C. Almost all cp genomes could not bear 75 °C and their absorbance increased greatly. However, only D. fragrans maintained it from beginning to 55 °C and changed slightly from 65 °C to 75 °C. Even at 85 °C, D. fragrans still kept the lowest value compared to the others. It showed that its cp genome could cope with heating.

Figure 4
figure 4

Percent of absorbance increases the variations of cpDNA in the thermal denaturation. The absorbance increases of 8 plants, including 6 ferns, 1 dicotyledon and 1 monocotyledon, were compared with D. fragrans (red). D. fragrans shows considerable stability against heat.

Discussion

DOGMA is the most popular software for cp genome annotation and is used widely27,28,29,30. This software can detect protein-coding genes, rRNA, and tRNA quickly. However, it also has drawbacks in detecting genes, because its ability to detect introns is not very sensitive. In our work, DOGMA annotated 110 genes but did not detect genes with intron(s). We performed another software analysis using MAKER-P. It detected 9 genes (ndhF, rpl21, rps6, cemA, ccsA, lhbA, matK, ycf1 and ycf12) that were not annotated by DOGMA. It also identified 14 genes containing intron(s), and their positions were also corrected (Table 1). The genes in the fern cp genomes are different from those of seed plants, although both of them are vascular plants. This made DOGMA perform not very well in ferns and produced incorrect result. The MAKER-P showed an advantage in the detection of protein-coding genes and introns. Thus, annotation for fern cp genomes requires the use of different software programs.

The typical size of fern cp genome is 131 to 168 kb31,32, and the D. fragrans cp genome is within this range. The gene number and order are largely similar to the cp genomes of ferns, but there are some differences among species. The genome size variation is mostly due to length variation in the IR and the SSC regions28. Some IR expansions/contractions are observed within species. Compared to other ferns, the IR regions of the D. fragrans cp genome lost a 4033 bp sequence, including trnR-ACG, rrn5, rrn4.5, rrn23, trnA-UGC and trnN-GUU. This sequence was located in IRs of other fern cp genomes. However, this sequence in the D. fragrans cp genome was moved into SSC. Thus, these genes did not exist as two copies in the D. fragrans cp genome. It is possible that the fern reduced the expression of these genes. The phenomenon causes the D. fragrans cp genome to contain the longest SSC and shorter IRs (Table 2). Alhough synteny and inversions are important, the structure of the D. fragrans cp genome does not show obvious changes. It is consistent with results of Xiang et al.9. Furthermore, overlapped genes are not notable (Fig. 1), which would reduce the cp genome usage. Thus, the D. fragrans cp genome has more intergenic sequences, leading to a more dispersed gene distribution and increasing the sequence length. These findings suggest that the fern cp genome chooses reduces the overlapping genes but extends the intergenic sequences. Thus, genes are more independent and sequence utilization is more specific.

RNA editing is an important post-transcriptional process in cps and is thought to be functionally significant33. In our work, the phenomenon was obvious in protein-coding genes. There were great disparities between the results of PREP and transcript validation. The number of RNA editing sites and codon transformations in the PREP results were far more than those of the validation. This result indicates that there may be some deficiencies in PREP, though the prediction results conformed to the number and quantity of predicted variations in general. This may be because the PREP database is not abundant enough, especially for ferns. Moreover, the result shows that transcript verification is necessary for RNA editing site prediction. In seed plant cps, a conversion from C-to-U is the most predominant form34, and reverse U-to-C editing is the opposite in seed plants35. Most editing events in the D. fragrans cp genome were C-to-U (88.4%) events. At the same time, its C-to-U transition is the most frequent type of base change. It has been reported that the excess of C-to-U RNA editing developed in early stages of vascular plant evolution36. Our results support this view. On the other hand, the number and percent of codons transitioning to Leu were the highest in the D. fragrans cp genome. It is similar with those of Adiantum capillus-veneris, though the genetic distance between A. capillus-veneris and D. fragrans is long in the fern clade. Leu biosynthesis occurs in cp and plays an important role in photosynthesis-related metabolism37,38. Both species account for a heavily used Leu codon, suggesting they have a great need for Leu. Their level of RNA editing is more than ten times that of any other vascular plant examined across an entire cp genome39. This reflects the fact that RNA editing occurs in different fern species and may play a major role in fern cp and cp genome processing.

Simple sequence repeats (SSRs) ranging in length from 1–6 or or more base pairs, also known as microsatellites and short tandem repeats (STRs), are important genetic molecular markers for population genetics40,41 and are widely used for plant genotyping42,43. In this work, there were 44 SSRs in the D. fragrans cp genome. The number of GC SSRs was more than the number of AT SSRs. This finding contrasts with the view that cp SSRs are generally composed of short polyadenine (poly A) or polythymine (poly T) repeats and rarely contain tandem guanine (G) or cytosine (C) repeats44. On the other hand, the number and percent of repeat structure (30–55 bp) in the D. fragrans cp genome were far more than other species (Fig. 3). This is the first time that a fern species has been shown to contain a considerable number of repeat structures. It shows that its cp genome is rich in repeat structures. At the same time, most repeat structures were located in every IGS dispersedly, and the GC percentages of most repeat structures were higher than the average value (43.04%). This indicates that the dispersed repeat structures probably play a key role in maintaining cp genome stability. D. fragrans may increase the IGS number and length of inserted repeat structures with a high active GC content. Previous has work suggested that repeat structures are very important for sequence rearrangement and variation in cp genomes by preventing illegitimate combinations and slipped-strand mispairing1,45,46. Our results could support this point of view. Furthermore, these repeat structures have also became a part of the intergenic sequences between genes, resulting in the independence of each gene. This feature allows for the selective expression of genes.

Wudalianchi was formed in great volcanic eruption. Its physiognomy is mainly consist of alkaline basalt47. This is a kind of black volcanic rock with low specific heat capacity (0.84 kJ/(kg·K)) and small thermal conductivity. The basalt would absorb lots of heat quickly under direct and long sunshine in summer. It could result in high surface temperature easily and form a local hot environment in the range of basalt geomorphology. The temperature of basalt surface in summer often reaches 70 °C. On the other hand, the basalt topography cools quickly at night. D. fragrans grows on the exposed basalt surface and is exposed to large temperature fluctuations between day and night. Most ferns cannot endure such high temperature and dramatic temperature changes, but D. fragrans is a rare fern and is highly resistant to heat. In the results mentioned above, our study revealed this fern possesses the greatest number of repeat structures, with a high GC percentage, among all ferns studied. The three hydrogen bonds between GC are stronger than the two between AT, such that the GC percentage determines the strength of the DNA double chain (i.e., loose or tight). Some researchers have noted that the higher the level of GC content, the more stable the structure of the genome DNA48. We speculate that these repeat structures with high GC content may allow the fern to cope with heat and large temperature differences. Thus, we performed a heat denaturation experiment to compare the cpDNA thermal stability of ferns species and closely related species from different habitats and families, including Nephrolepidaceae, Thelypteridaceae, Pteridaceae, Davalliaceae, Aspleniaceae, Polypodiaceae, Dryopteridaceae, Dennstaedtiaceae, Parkeriaceae and Isoetaceae. Arabidopsis and wheat showed the most obvious changes in the denaturation experiment. This indicates that their cpDNA is very sensitive to heat. S. sessilifolia, T. palustris, C. fortunei, I. sinensis and C. thalictroides changed earlier and largely under 45 °C. The habitats of these five species are swamps or humid underforest environments. Their heat resistance was weak. D. chingia, N. biserrata, P. scolopendrium, M. strigosa and P. amoena showed smaller variations and similar heat stability between 35–45 °C. However, most of them could not survive at temperatures over 45 °C and began to rise significantly with the increase of temperature (Fig. 4). D. chingia, P. amoena and M. strigosa are saxicolous ferns in forest. Their cpDNA showed heat resistance and their thermal stability was limited. In addition, there were great differences between D. fragrans and C. fortunei in terms of their thermal stability, although both of them belong to the Dryopteridaceae family. This suggests that great differences exist within species of the same family, which is caused by different environments. These results support the speculation that a considerable number of dispersed repeat structures with a high GC content (43.04%) enhance D. fragrans cpDNA thermal stability and maintain its structure in the face of thermal changes. It is one of molecular basis of D. fragrans in response to severe environments. It also provided a new scope for understanding the environmental adaption mechanisms of plants.

Methods

Plants, cp DNA extraction, sequencing and assembly

Fresh leaves of D. fragrans from Wudalianchi, Heilongjiang Province were collected and frozen in liquid nitrogen after cleaning. Professor Baodong Liu of Harbin Normal University provided the leaves of Nephrolepis biserrata, Polypodiodes amoena, Isoetes sinensis, Cyrtomium fortunei, Phyllitis scolopendrium, Davallodes chingiae, Scutellaria sessilifolia, Microlepia strigosa, and Thelypteris palustris. The Arabidopsis thaliana, wheat (Triticum aestivum L.) and Ceratopteris thalictroides were collected in our lab. The cp isolation methods were modified based on previous methods49,50,51. Five grams of complete leaves from all species were picked and rinsed. The leaves were then crushed in liquid nitrogen and added to a separation solution (0.33 M D-Sorbitol, 50 mM Tris-HCl pH 7.6, 5 mM MgCl2, 10 mM NaCl, 2 mM EDTA, 2 mM D-sodium erythorbat and 0.2% beta-mercaptol) for grinding. The cell suspensions were filtered, and the filtrate was centrifuged at 1000 rpm for 10 min to eliminate large-sized cell fragments. The supernatant was collected and centrifuged at 4000 rpm for 10 min to separate the cp. We obtained cps and extracted pure cpDNA using the CTAB-based method52. Every DNA sample was treated with RNase. To assess the contamination of the nuclear genome, nuclear special gene, actin6, and cp special gene, rbcL, were selected to performed qRT-PCR. The specific primer pairs or degenerate primer pairs were designed based on special sequence or homologous sequences (Supplemental Table 1). LineGene 9620 instrument (HANGZHOU BIOER TECHNOLOGY Co., LTD. China) and TransStart Green qPCR SuperMix (TRANSGEN BIOTECH, China) were used for detection. The qRT-PCR program was set as follows, 5 min at 95 °C, 40 cycles of 15 s at 95 °C, and 30 s at 60 °C. The cpDNA samples of D. fragrans cp genome were sequenced using Illumina technology on HiSeq. 2000 at Genewiz (China). To improve the Illumina sequence read quality and accuracy of the sequences, we performed Trimmomatic (version 0.30)53 to optimize the processing for filtering the adaptor sequence. The software SSPACE (version 3.0)54, GapFiller (version 1.10)55 and Velvet (version 1.2.10)56 were used to examine the raw reads and assemble them into contigs and scaffolds with default parameters. There were some gaps left after assembly. To finish the assembly of the whole genome, gaps were filled by PCR. PCR reactions were performed in a total volume of 20 μL containing 6 μL of deionized sterile water, 10 μL of EasyTaq Mix buffer, 1 μL of each primer at 10 pmol/μL (TransGen Biotech, Beijing, China) and 2 μL of cp DNA. PCR products were purified and sequenced by Bio-Serve (Harbin, China). All primers used for gap filling are listed in Supplemental Table 1.

Chloroplast genome annotation and comparative analyses

Gene location and annotations of the D. fragrans cp were performed using the Dual Organellar GenoMe Annotator (DOGMA) (http://dogma.ccbb.utexas.edu)57 and MAKER-P58,59, including protein-coding and rRNA and tRNA genes. All genes, rRNAs, and tRNAs were identified using the plastid/bacterial genetic code. The predicted annotations were verified using Chloroplast Genome DB (http://chloroplast.cbio.psu.edu/)60 and Blast61. tRNAscan-SE was used to identify the tRNAs62. Codon usage and relative synonymous codon usage (RSCU) were calculated by CodonW 1.4.2 (http://codonw.sourceforge.net)63. The annotated sequence was submitted to GenBank. The circular gene map of the D. fragrans cp was generated using OGDRAW64.

The GC%, LSC, SSC, IR regions, gene number and length of complete genome of the D. fragrans cp genome were compared to the cp genomes from Adiantum capillus-veneris (NC_004766), Osmundastrum cinnamomeum (NC_024157), Cyrtomium devexiscapulae (KT599100), Woodwardia unigemmata (KT599101), Alsophila spinulosa (NC_012818), Psilotum nudum (NC_003386), Pteridium aquilinum subsp. aquilinum (NC_104348), Angiopteris evecta (NC_008829), Isoetes flaccida (NC_014675), Huperzia lucidula (NC_006861), Athyrium anisopterum (NC_035738.1), Athyrium opacum (NC_035841.1), Austroblechnum melanocaulon (NC_035840.1), Deparia lancea (NC_035844.1), Diplazium dushanense (NC_035851.1), Homalosorus pycnocarpos (NC_035855.1), Macrothelypteris torresiana (NC_035858.1), Matteuccia struthiopteris (NC_035859.1), Onoclea sensibilis (NC_035861.1), Pseudophegopteris aurita (NC_035861.1), Ginkgo biloba (AB684440), Arabidopsis thaliana (NC_000932) and Oryza sativa (NC_001320). Furthermore, we compared the borders, gene content and order of the LSC, SSC and IRs regions with those of A. capillus-veneris, P. aquilinum subsp. aquilinum and C. devexiscapulae.

Examination of the repeat sequences and RNA editing

MISA, a microsatellite identification tool (http://pgrc.ipk-gatersleben.de/misa/misa.html), was used to detect SSRs65, with thresholds of mononucleotide repeats ≥10 bases, dinucleotide repeats ≥6 bases, tri- and tetranucleotide repeats ≥5 bases, and hexanucleotide or greater repeats ≥5 bases. The max distance between two SSRs was 100 base pairs. Based on these analyses, we identified the location of the SSRs. The REPuter program66 was used to assess long repeat sequences on the forward, reverse and palindrome sequences within the cp genomes. The sizes of the repeats were set at a repeat minimal length of ≥30 bp and a maximal length of ≤55 bp with a Hamming distance of 3. Furthermore, we selected 29 ferns, including Adiantum capillus-veneris, Alsophila spinulosa, Angiopteris evecta, Athyrium opacum, A. sinense, A. sheareri, Austroblechnum melanocaulon, Cyrtomium falcatum, C. devexiscapulae, Deparia pycnosora, Diplaziopsis cavaleriana, Diplazium dilatatum, D. dushanense, D. lancea, D. striatum, Dryopteris decipiens, Homalosorus pycnocarpos, Lygodium japonicum, Macrothelypteris torresiana, Matteuccia struthiopteris, Marsilea crenata, Osmundastrum cinnamomeum, Onoclea sensibilis, Psilotum nudum, Pteridium aquilinum subsp. aquilinum, Schizaea elegans, S. pectinat, Thelypteris aurita and W. unigemmata to calculate the long repeat sequences using the same parameters. We compared differences in the repeat numbers of different lengths from those ferns.

Prediction and Transcript validation of RNA editing sites

The predictive RNA Editor for Plants (PREP) was used to predict potential RNA editing sites in protein-coding genes with a cutoff value of 0.867. The protein-coding genes were accD, atpA, atpB, atpI, ccsA, clpP, matK, ndhB, ndhD, ndhF, ndhG, petB, petD, petD, petL, psaI, psbB, psbE, psbF, psbL, rpoA, rpoB, rpoC1, rps14, rps2, rps2 and ycf3. Total RNA was isolated from leaves using Tiangen™ Plant Total RNA Kit (China). The quality and concentration of RNA samples were examined by agarose gel electrophoresis and spectrophotometer analysis, respectively. The first-strand cDNA was prepared with 3 μg of total RNA using the TransScript One-Step gDNA Removal and cDNA Synthesis SuperMix Kit (Transgen Biotech, China). Primer pairs for each gene were designed based on extracted gene sequences and are listed in Supplemental Table 5. The PCR was carried out as follows: 5 min at 95 °C, 30 cycles of 30 s at 95 °C, 30 s at 56–63 °C, 60 s at 72 °C and 10 min at 72 °C. The PCR products were sequenced at HaiGene (Harbin, China). The sequences were aligned with extracted gene sequences. The RNA editing sites validated by PCR were collected and compared to the PREP results.

Thermal denaturation and renaturation of cp genomes

DNA denaturation produces hyperchromic effect. The isolated cpDNA from all species was used. For denaturation, the absorbance increase could reflect the thermal stability of cpDNA by gradient thermal treatment. All cpDNA from species were dissolved in Tris-EDTA (TE) buffer. The concentration was adjusted to 50 ng/mL. Absorbance at 260 nm was used to monitor the denaturation processes, and the TE buffer was used as a blank control. Each sample was treated in different temperatures water bath (25 °C, 35 °C, 45 °C, 55 °C, 65 °C, 75 °C, 85 °C, 95 °C), and each temperature treatment went for 10 min. The absorbance of the initial temperature treatment was set as A0 (15 °C), and the value of highest temperature treatment was set at Amax (95 °C). The cp DNA was heated to 95 °C in order to melt the DNA completely and determine limit values for cp genomes, such that the standards for each cp genome can be provided. The formula to determine the increase percentages of the absorbance increase (AI) of the hyperchromic effects as: AI = (Atemp - A0)/(Amax - A0) × 100%. The recorded data were repeated 3 times and collected to calculate.