Complete chloroplast genome sequence of Dryopteris fragrans (L.) Schott and the repeat structures against the thermal environment

Dryopteris fragrans (L.) Schott is a fern growing on the surface of hot rocks and lava. It is exposed to sunlight directly and bears local hot environment. We sequenced the complete nucleotide sequence of its chloroplast (cp) genome. The cp genome was 151,978 bp in length, consisting of a large single-copy region (85,332 bp), a small single-copy region (31,947 bp) and a pair of inverted repeats (17,314 bp). The cp genome contained 112 genes and 345 RNA editing sites in protein-coding genes. Simple sequence repeats (SSRs) and long repeat structure pairs (30–55 bp) were identified. The number and percent of repeat structures are extremely high in ferns. Thermal denaturation experiments showed its cp genome to have numerous, dispersed and high GC percent repeat structures, which conferred the strongest thermal stability. This repeat-heavy genome may provide the molecular basis of how D. fragrans cp survives its hot environment.


Results
Chloroplast genome assembly and validation. Quantitative real-time polymerase chain reaction (qRT-PCR) result showed that the rbcL was detected in isolated cpDNA samples, while actin 6, the nuclear specific gene, was not detected (Supplemental Fig. 1). It showed that the cpDNA samples were pure. The sequencing run generated 2,740,440 raw reads, totaling 822,132,000 bases, with an average read length of 300 bp from the D. fragrans cp genome. A total of 2,262,910 clean reads with an average read length of 184.56 bp were de novo assembled into 31 contigs. The average sequence coverage depth of each nucleotide on the cp genome was 105X . A maximum scaffold size of 143,707 bp that spanned most of the small and large single copy region (SSC and LSC) and the entire inverted repeat (IR) region was generated. Because the IR region had double the coverage compared with the remaining scaffold, it was used twice in the complete cpDNA sequence. We submitted the annotated cp genome sequence of D. fragrans under GenBank accession number KX418656.2.
Chloroplast genome features and comparison. The cp genome of D. fragrans was 151,978 bp in length, with a typical quadripartite structure (Fig. 1). It included a pair of IRA and IRB of 17,314 bp separated by one SSC and one LSC of 31,947 bp and 85,332 bp, respectively. The D. fragrans cp genome contained 112 genes (Table 1), including 4 ribosomal RNA genes; 26 transfer RNA genes, and 22 genes encoding ribosomal subunits, of which 12 encoded the small subunit and 10 for the large subunit. It also included 3 genes encoding DNA-directed RNA polymerases, and 44 genes dedicated to photosynthesis, of which 11 encoded subunits of the NAD(P)H-quinone oxidoreductase, 4 encoded the photosystem I complex, 13 encoded the photosystem II complex, 6 encoded the cytochrome b/f complex, 6 encoded different subunits of the ATP synthase and 1 encoded the large chain of the ribulose bisphosphate carboxylase (rbcL). Three genes encoded the dark-operative protochlorophyllide oxidoreductase subunits; 5 genes (ycf1, 2, 3,4,12) were dedicated to open reading frames; 2 were detected to protease; 1 encoded a translational initiation factor and 5 for other proteins. Among them, 14 genes contained introns, including psaA, atpF, ndhA, ndhB, ndhE, ndhF, ndhG, rpl2, rpl20, rpoB, rpoC1, cemA, clpP and ycf3 (Table 1). Compared with Adiantum capillus-veneris, Pteridium subsp. aquilinum and Cyrtomium devexiscapulae, the D. fragrans cp genome gained orf42, trnR-ACG, rrn5, rrn4.5, rrn23, trnA-UGC and trnN-GUU in SSC. In addition, trnR-ACG, rrn5, rrn4.5, rps12 and trnI-GUG were lost in IRs, and ndhB was truncated in IRA. The psbK and trnG-UCC were lost in IRB-LSC (Fig. 2). The GC content of the cp genome was 43.15%. The IR region was the highest (44.18%), followed by the LSC (42.70%) and the SSC (43.26%). The GC contents in rRNA genes (55.75%) and tRNA genes (55.46%) were higher than those in protein coding regions (44.06%). The comparison of cp genome size, GC content, gene number and order is listed in Table 2. The D. fragrans cp genome did not contain tRNA for the amino acid codons Lys. The codon usage frequency was listed in Table 3. Among these codons, 987 (3.57%) encoding for glutamate and 166 (0.60%) for cysteine, were the most and the least amino acids codons, respectively.
SSRs and repeat structures in the D. fragrans chloroplast genome. The MISA detected 44 SSRs in the D. fragrans cp genome (Supplemental Table 2), including 41 homopolymers and 3 dipolymers. Tetrapolymers, pentapolymers, and hexapolymers were not found. Sixteen SSRs were exclusively composed of A or T bases, 27 SSRs were G or C bases, and 1 was an AG base. Most of the bases were G or C bases, except for the AG dipolymer. All of these SSRs were located in the IGS, and none were located in protein-coding genes. Repeat analysis by REPuter, with the criterion of a copy size of ≥30 bp or longer and a sequence identity ≥90%, identified 80 forward, 1 reverse and 23 palindromic repeat structure pairs from 30 to 55 bp. Repeat lengths of 30 to 32 bp were most common (27.40%). A total of 53 repeat pairs were found in the coding regions, of which 6 were located in the transfer RNA genes. The remaining 151 repeat pairs were located in the IGSs (Supplemental Table 3). In addition, one of the longest repeat structure (55 bp) overlapped with the longest SSR sequence (18 bp G mononucleotide sequence) (Supplemental Table 3 RNA editing. The transcripts obtained by PCR were used to verify the RNA editing sites predicted by PREP. The PREP prediction results showed that there were 438 RNA editing sites in protein-encoding genes, corresponding to 338 codon changes. All editing events were of the C-to-U variety. Among them, 96 non-synonymous mutations were found at the first position of the codon, while the remaining mutations were found at the second position, and none were found at the third position. However, in the transcript validation results, there were 345 RNA editing sites in the D. fragrans cp genome (Supplemental Table 4). In all, 88 mutations occurred in the first position of the codon, 208 at the second position, and 49 at the third position. The C-U mutations were the most common, reaching 305 (88.41%) mutations, followed by U-C 13 (3.77%), G-A 8 (2.32%), A-G 7 (2.03%), A-C 3 (0.87%), C-G 2 (0.58%), G-U 2 (0.58%), U-A 2 (0.58%), U-G 2 (0.58%) and G-C 1 (0.29%) (Supplemental Fig. 2). There were 318 codon changes, including 33 synonymous mutations and 285 nonsynonymous mutations. The majority of editing sites were predicted in the ndhF gene (130629-123964, 41 editing sites), followed by the atpB gene (72943-71462, 21 editing sites). The conversions of amino acids included 119 hydrophilic to hydrophobic changes (H to Y, S to L, S to F, and T to M) and 105 hydrophobic to hydrophobic changes (L to F, A to V, P to S, R to W and P to L). The codons that turned into leucine (Leu) were the most common, accounting for 125 changes Comparison of cpDNA thermal stability. To confirm that repeat structures with high GC content contributed enhanced the cp genome stability of D. fragrans, the thermal denaturation for all species cp genomes were completed. In the denaturation experiment, the absorbance of all samples increased with elevated temperatures (Fig. 4)

Discussion
DOGMA is the most popular software for cp genome annotation and is used widely [27][28][29][30] . This software can detect protein-coding genes, rRNA, and tRNA quickly. However, it also has drawbacks in detecting genes, because its ability to detect introns is not very sensitive. In our work, DOGMA annotated 110 genes but did not detect genes with intron(s). We performed another software analysis using MAKER-P. It detected 9 genes (ndhF, rpl21, rps6, cemA, ccsA, lhbA, matK, ycf1 and ycf12) that were not annotated by DOGMA. It also identified 14 genes containing intron(s), and their positions were also corrected ( Table 1). The genes in the fern cp genomes are different from those of seed plants, although both of them are vascular plants. This made DOGMA perform not very well in ferns and produced incorrect result. The MAKER-P showed an advantage in the detection of protein-coding genes and introns. Thus, annotation for fern cp genomes requires the use of different software programs. The typical size of fern cp genome is 131 to 168 kb 31,32 , and the D. fragrans cp genome is within this range. The gene number and order are largely similar to the cp genomes of ferns, but there are some differences among species. The genome size variation is mostly due to length variation in the IR and the SSC regions 28 . Some IR expansions/contractions are observed within species. Compared to other ferns, the IR regions of the D. fragrans cp genome lost a 4033 bp sequence, including trnR-ACG, rrn5, rrn4.5, rrn23, trnA-UGC and trnN-GUU. This sequence was located in IRs of other fern cp genomes. However, this sequence in the D. fragrans cp genome was moved into SSC. Thus, these genes did not exist as two copies in the D. fragrans cp genome. It is possible that the fern reduced the expression of these genes. The phenomenon causes the D. fragrans cp genome to contain the longest SSC and shorter IRs ( Table 2). Alhough synteny and inversions are important, the structure of the D. fragrans cp genome does not show obvious changes. It is consistent with results of Xiang et al. 9 . Furthermore, overlapped genes are not notable (Fig. 1), which would reduce the cp genome usage. Thus, the D. fragrans cp genome has more intergenic sequences, leading to a more dispersed gene distribution and increasing the sequence length. These findings suggest that the fern cp genome chooses reduces the overlapping genes but extends the intergenic sequences. Thus, genes are more independent and sequence utilization is more specific.
RNA editing is an important post-transcriptional process in cps and is thought to be functionally significant 33 . In our work, the phenomenon was obvious in protein-coding genes. There were great disparities between the results of PREP and transcript validation. The number of RNA editing sites and codon transformations in the PREP results were far more than those of the validation. This result indicates that there may be some deficiencies in PREP, though the prediction results conformed to the number and quantity of predicted variations in general. This may be because the PREP database is not abundant enough, especially for ferns. Moreover, the result shows that transcript verification is necessary for RNA editing site prediction. In seed plant cps, a conversion from C-to-U is the most predominant form 34 , and reverse U-to-C editing is the opposite in seed plants 35 . Most editing events in the D. fragrans cp genome were C-to-U (88.4%) events. At the same time, its C-to-U transition is the most frequent type of base change. It has been reported that the excess of C-to-U RNA editing developed in early stages of vascular plant evolution 36 . Our results support this view. On the other hand, the number and percent of codons transitioning to Leu were the highest in the D. fragrans cp genome. It is similar with those of Adiantum capillus-veneris, though the genetic distance between A. capillus-veneris and D. fragrans is long in the fern clade. Leu biosynthesis occurs in cp and plays an important role in photosynthesis-related metabolism 37,38 . Both species account for a heavily used Leu codon, suggesting they have a great need for Leu. Their level of RNA editing is more than ten times that of any other vascular plant examined across an entire cp genome 39 . This reflects the fact that RNA editing occurs in different fern species and may play a major role in fern cp and cp genome processing. Simple sequence repeats (SSRs) ranging in length from 1-6 or or more base pairs, also known as microsatellites and short tandem repeats (STRs), are important genetic molecular markers for population genetics 40,41 and are widely used for plant genotyping 42,43 . In this work, there were 44 SSRs in the D. fragrans cp genome. The number of GC SSRs was more than the number of AT SSRs. This finding contrasts with the view that cp SSRs are generally composed of short polyadenine (poly A) or polythymine (poly T) repeats and rarely contain tandem guanine (G) or cytosine (C) repeats 44 . On the other hand, the number and percent of repeat structure (30-55 bp) in the D. fragrans cp genome were far more than other species (Fig. 3). This is the first time that a fern species has been shown to contain a considerable number of repeat structures. It shows that its cp genome is rich in repeat structures. At the same time, most repeat structures were located in every IGS dispersedly, and the GC percentages of most repeat structures were higher than the average value (43.04%). This indicates that the dispersed repeat structures probably play a key role in maintaining cp genome stability. D. fragrans may increase the IGS number and length of inserted repeat structures with a high active GC content. Previous has work suggested Wudalianchi was formed in great volcanic eruption. Its physiognomy is mainly consist of alkaline basalt 47 . This is a kind of black volcanic rock with low specific heat capacity (0.84 kJ/(kg·K)) and small thermal conductivity. The basalt would absorb lots of heat quickly under direct and long sunshine in summer. It could result in high surface temperature easily and form a local hot environment in the range of basalt geomorphology. The temperature of basalt surface in summer often reaches 70 °C. On the other hand, the basalt topography cools quickly at night. D. fragrans grows on the exposed basalt surface and is exposed to large temperature fluctuations between day and night. Most ferns cannot endure such high temperature and dramatic temperature changes, but D. fragrans is a rare fern and is highly resistant to heat. In the results mentioned above, our study revealed this fern possesses the greatest number of repeat structures, with a high GC percentage, among all ferns studied. The three hydrogen bonds between GC are stronger than the two between AT, such that the GC percentage determines the strength of the DNA double chain (i.e., loose or tight). Some researchers have noted that the higher the level of GC content, the more stable the structure of the genome DNA 48 . We speculate that these repeat structures with high GC content may allow the fern to cope with heat and large temperature differences. Thus, we performed a heat denaturation experiment to compare the cpDNA thermal stability of ferns species and closely related species from different habitats and families, including Nephrolepidaceae, Thelypteridaceae, Pteridaceae, Davalliaceae, Aspleniaceae, Polypodiaceae, Dryopteridaceae, Dennstaedtiaceae, Parkeriaceae and Isoetaceae. Arabidopsis and wheat showed the most obvious changes in the denaturation experiment. This indicates that their cpDNA is very sensitive to heat. S. sessilifolia, T. palustris, C. fortunei, I. sinensis and C. thalictroides changed earlier and largely under 45 °C. The habitats of these five species are swamps or humid underforest environments. Their heat resistance was weak. D. chingia, N. biserrata, P. scolopendrium, M. strigosa and P. amoena showed smaller variations and similar heat stability between 35-45 °C. However, most of them could not survive at temperatures over 45 °C and began to rise significantly with the increase of temperature (Fig. 4). D. chingia, P. amoena and M. strigosa are saxicolous ferns in forest. Their cpDNA showed heat resistance and their thermal stability was limited. In addition, there were great differences between D. fragrans and C. fortunei in terms of their thermal stability, although both of them belong to the Dryopteridaceae family. This suggests that great differences exist within species of the same family, which is caused by different environments. These results support the speculation that a considerable number of dispersed repeat structures with a high GC content (43.04%) enhance D. fragrans cpDNA thermal stability and maintain its structure in the face of thermal changes. It is one of molecular basis of D. fragrans in response to severe environments. It also provided a new scope for understanding the environmental adaption mechanisms of plants.   .) and Ceratopteris thalictroides were collected in our lab. The cp isolation methods were modified based on previous methods [49][50][51] . Five grams of complete leaves from all species were picked and rinsed. The leaves were then crushed in liquid nitrogen and added to a separation solution (0.33 M D-Sorbitol, 50 mM Tris-HCl pH 7.6, 5 mM MgCl 2 , 10 mM NaCl, 2 mM EDTA, 2 mM D-sodium erythorbat and 0.2% beta-mercaptol) for grinding. The cell suspensions were filtered, and the filtrate was centrifuged at 1000 rpm for 10 min to eliminate large-sized cell fragments. The supernatant was collected and centrifuged at 4000 rpm for 10 min to separate the cp. We obtained cps and extracted pure cpDNA using the CTAB-based method 52 . Every DNA sample was treated with RNase. To assess the contamination of the nuclear genome, nuclear special gene, actin6, and cp special gene, rbcL, were selected to performed qRT-PCR. The specific primer pairs or degenerate primer pairs were designed based on special sequence or homologous sequences (Supplemental Table 1     Chloroplast genome annotation and comparative analyses. Gene location and annotations of the D. fragrans cp were performed using the Dual Organellar GenoMe Annotator (DOGMA) (http://dogma.ccbb. utexas.edu) 57 and MAKER-P 58,59 , including protein-coding and rRNA and tRNA genes. All genes, rRNAs, and tRNAs were identified using the plastid/bacterial genetic code. The predicted annotations were verified using Chloroplast Genome DB (http://chloroplast.cbio.psu.edu/) 60 63 . The annotated sequence was submitted to GenBank. The circular gene map of the D. fragrans cp was generated using OGDRAW 64  Examination of the repeat sequences and RNA editing. MISA, a microsatellite identification tool (http://pgrc.ipk-gatersleben.de/misa/misa.html), was used to detect SSRs 65 , with thresholds of mononucleotide repeats ≥10 bases, dinucleotide repeats ≥6 bases, tri-and tetranucleotide repeats ≥5 bases, and hexanucleotide or greater repeats ≥5 bases. The max distance between two SSRs was 100 base pairs. Based on these analyses, we identified the location of the SSRs. The REPuter program 66 was used to assess long repeat sequences on the forward, reverse and palindrome sequences within the cp genomes. The sizes of the repeats were set at a repeat minimal length of ≥30 bp and a maximal length of ≤55 bp with a Hamming distance of 3. Furthermore, we selected 29  Prediction and Transcript validation of RNA editing sites. The predictive RNA Editor for Plants (PREP) was used to predict potential RNA editing sites in protein-coding genes with a cutoff value of 0.8 67 . The protein-coding genes were accD, atpA, atpB, atpI, ccsA, clpP, matK, ndhB, ndhD, ndhF, ndhG, petB, petD, petD, petL, psaI, psbB, psbE, psbF, psbL, rpoA, rpoB, rpoC1, rps14, rps2, rps2 and ycf3. Total RNA was isolated from leaves using Tiangen ™ Plant Total RNA Kit (China). The quality and concentration of RNA samples were examined by agarose gel electrophoresis and spectrophotometer analysis, respectively. The first-strand cDNA was prepared with 3 μg of total RNA using the TransScript One-Step gDNA Removal and cDNA Synthesis SuperMix Kit (Transgen Biotech, China). Primer pairs for each gene were designed based on extracted gene sequences and are listed in Supplemental Table 5. The PCR was carried out as follows: 5 min at 95 °C, 30 cycles of 30 s at 95 °C, 30 s at 56-63 °C, 60 s at 72 °C and 10 min at 72 °C. The PCR products were sequenced at HaiGene (Harbin, China). The sequences were aligned with extracted gene sequences. The RNA editing sites validated by PCR were collected and compared to the PREP results.
Thermal denaturation and renaturation of cp genomes. DNA denaturation produces hyperchromic effect. The isolated cpDNA from all species was used. For denaturation, the absorbance increase could reflect the thermal stability of cpDNA by gradient thermal treatment. All cpDNA from species were dissolved in Tris-EDTA (TE) buffer. The concentration was adjusted to 50 ng/mL. Absorbance at 260 nm was used to monitor the denaturation processes, and the TE buffer was used as a blank control. Each sample was treated in different temperatures water bath (25 °C, 35 °C, 45 °C, 55 °C, 65 °C, 75 °C, 85 °C, 95 °C), and each temperature treatment went for 10 min. The absorbance of the initial temperature treatment was set as A 0 (15 °C), and the value of highest temperature treatment was set at A max (95 °C). The cp DNA was heated to 95 °C in order to melt the DNA completely and determine limit values for cp genomes, such that the standards for each cp genome can be provided. The formula to determine the increase percentages of the absorbance increase (AI) of the hyperchromic effects as: AI = (A temp -A 0 )/(A max -A 0 ) × 100%. The recorded data were repeated 3 times and collected to calculate.