Chloroplast genomes of Arabidopsis halleri ssp. gemmifera and Arabidopsis lyrata ssp. petraea: Structures and comparative analysis

We investigated the complete chloroplast (cp) genomes of non-model Arabidopsis halleri ssp. gemmifera and Arabidopsis lyrata ssp. petraea using Illumina paired-end sequencing to understand their genetic organization and structure. Detailed bioinformatics analysis revealed genome sizes of both subspecies ranging between 154.4~154.5 kbp, with a large single-copy region (84,197~84,158 bp), a small single-copy region (17,738~17,813 bp) and pair of inverted repeats (IRa/IRb; 26,264~26,259 bp). Both cp genomes encode 130 genes, including 85 protein-coding genes, eight ribosomal RNA genes and 37 transfer RNA genes. Whole cp genome comparison of A. halleri ssp. gemmifera and A. lyrata ssp. petraea, along with ten other Arabidopsis species, showed an overall high degree of sequence similarity, with divergence among some intergenic spacers. The location and distribution of repeat sequences were determined, and sequence divergences of shared genes were calculated among related species. Comparative phylogenetic analysis of the entire genomic data set and 70 shared genes between both cp genomes confirmed the previous phylogeny and generated phylogenetic trees with the same topologies. The sister species of A. halleri ssp. gemmifera is A. umezawana, whereas the closest relative of A. lyrata spp. petraea is A. arenicola.


Results
Chloroplast genome features and the structure of two Arabidopsis subspecies. Two Arabidopsis species, A. halleri ssp. gemmifera and A. lyrata ssp. petraea, were sequenced on an Illumina HiSeq. 2000 to produce 63,528,604 and 67,938,537 bp paired-end raw reads, respectively. The average read length was 101 bp, and the cp genomes of A. halleri ssp. gemmifera and A. lyrata ssp. petraea received 2232.81x and 2173.5x coverage, respectively. The four junction regions for each resulting cp genome were also confirmed by PCR-based Sanger sequencing with four pairs of primers (Table S2). The cp genome sizes of A. halleri ssp. gemmifera and A. lyrata ssp. petraea were 154,473 and 154,489 bp, respectively (Table 1, Fig. 1). The genomes displayed a typical quadripartite structure, as shown by most angiosperms. This included a large single-copy region (LSC; 84197 bp and 84158 bp), a small single-copy region (SSC; 17738 bp and 17813 bp) and a pair of inverted repeats (IRa/IRb; 26264 bp and 26259 bp) in A. halleri ssp. gemmifera and A. lyrata ssp. petraea, respectively. The GC contents (36.4%) of both A. halleri ssp. gemmifera and A. lyrata ssp. petraea were very similar to those in the cp genomes of other Arabidopsis species (Table 1). However, the GC contents were unequally distributed in different fragments of the cp genomes, with the highest values in the IR regions (42.3%), median values in the LSC regions (34.1%) and the lowest values in the SSC regions (29.4%) ( Table 1). The high GC content of the IR regions might be due to the presence of eight ribosomal RNA (rRNA) sequences in these regions.
Repeat Analysis and comparison of its distribution in Arabidopsis subspecies. A total of 71 and 75 repeats were detected in the A. halleri ssp. gemmifera and A. lyrata ssp. petraea genomes, respectively, using REPuter, including direct, reversed and palindromic repeats (Tables S5, S6). In these cp genomes, the repeat analysis detected 11 and 12 palindromic repeats, 26 and 31 forward repeats, and 34 and 32 tandem repeats in A. halleri ssp. gemmifera and A. lyrata ssp. petraea, respectively (Fig. 3). Among these, 22 forward repeats were 30-44 bp in length, 5 tandem repeats were of the same length, and 27 were 15-29 bp in length ( Fig. 3A-D). Similarly, in

Simple Sequence Repeat (SSR) Analysis and comparison.
In this study, we detected perfect SSRs in the cp genomes of A. halleri ssp. gemmifera and A. lyrata ssp. petraea cp together with ten other Arabidopsis species (Fig. 4A-D). Certain parameters were set because SSRs of 10 bp or longer are prone to slipped strand mis-pairing, which is believed to be the main mutational mechanism of SSR polymorphisms. A total of 227 and 229 microsatellites were found in the A. halleri ssp. gemmifera and A. lyrata ssp. petraea cp genomes based on SSR analysis, respectively (Fig. 4A). Among these, 78 and 76 were found in coding regions, while 144 and 148 microsatellites were found in intergenic regions (Fig. 4A)   is the largest of these, and this difference was mostly attributed to variation in the length of the LSC region (Table 1), as reported previously in angiosperms cp genomes. Analysis of genes with known functions showed that A. halleri ssp. gemmifera and A. lyrata ssp. petraea shared 70 protein-coding genes with the other ten Arabidopsis species cp genomes. The number of unique genes found in these cp genomes was 79 (Fig. 5).

Inverted repeat (IR) contraction and expansion across two subspecies. A detailed comparison
was performed of four junctions (J LA , J LB , J SA and J SB ) between the two IRs (IRa and IRb) and the two single-copy regions (LSC and SSC) among A. thaliana, A. arenosa, A. cebennensis, A. pedemontana, A. arenicola, A. croatica,  A. neglecta, A. petrogena, A. suecica and A. umezawana in comparison to A. halleri ssp. gemmifera and A. lyrata ssp. petraea (Fig. 6). We carefully analysed and compared the exact IR border positions and the adjacent genes among the Arabidopsis species cp genomes (Fig. 6). In this study, despite the similar length of the IR regions in A. halleri ssp. gemmifera and A. lyrata ssp. petraea with those in the other ten   Phylogenetic analysis of A. halleri ssp. gemmifera and A. lyrata ssp. petraea. In this study, the phylogenetic position of A. halleri ssp. gemmifera and A. lyrata ssp. petraea within the family Brassicaceae was established by analysing multiple alignments of complete cp genomes and 70 shared genes of 29 Brassicaceae members representing 12 genera (Fig. 7 and Fig. S4). Carica papaya was set as the outgroup. Phylogenetic analyses using Bayesian inference (BI), maximum parsimony (MP), maximum likelihood (ML) and neighbour-joining (NJ) were performed. The results revealed that complete cp genomes and 70 shared genes of A. halleri ssp. gemmifera and A. lyrata ssp. petraea contain the same phylogenetic signals; the complete genome sequence and the 70 shared genes (from all species) generated phylogenetic trees with identical topologies (Fig. 7 and Fig. S4). Maximum likelihood (ML) analysis revealed 22 out of 26 nodes with bootstrap values ≥ 99%, and most of these nodes had 100% bootstrap values. In these phylogenetic trees based on the entire genome data set and the 70 shared genes, A. halleri ssp. gemmifera formed a single clade with A. umezawana, and A. lyrata ssp. petraea formed a single clade with A. arenicola for high Bayesian inference (BI), and bootstrap support using four different methods (Fig. 7 and Fig. S4).

Discussion
This study reports the complete chloroplast genomes of A. halleri ssp. gemmifera and A. lyrata ssp. petraea, ranging from 154.4~154.5 kbp in length. Both cp genomes exhibit a typical quadripartite structure, as reported for other angiosperms. Both genomes encode ~130 genes, including 85 protein-coding genes, 8 ribosomal RNA genes and 37 transfer RNA genes, with 227 to 229 microsatellites distributed randomly throughout their genomes, respectively. In addition, approximately 26/31 forward, 34/32 tandem and 11/12 palindromic repeats were found in both cp genomes. This conformed with the protein coding genes found in other Brassicaceae members, such as A. thaliana 16 , Brassica nigra, Brassica oleracea, 41 and Capsella rubella 42 . Among the coding genes, rps12 is an unequally divided gene, with its 5′ terminal exon located in the LSC region, while two copies of the 3′ terminal exon and intron are located in IRs, which is similar to other angiosperm cp genomes. The pseudogene ycf1 was located in the boundary regions between IRb and SSC, leading to incomplete duplication of the gene within IR regions. We found twelve intron-containing genes in both of the cp genomes, nine of which contained one intron, whereas the genes ycf3, clpP, and rps12 contained two introns. The gene ndhA had the longest intron in both A. halleri ssp. gemmifera (1,089 bp) and A. lyrata ssp. petraea (1,084) (Tables S10 and 11). This intron plays an important role in the regulation of gene expression, and some recent research revealed that various introns improve exogenous gene expression at specific positions. Therefore, the intron can be a valuable tool to improve transformational efficiency 43 . It was observed that ycf1, ycf2 44,45 , rpl23 46 and accD 47,48 are often absent in plants 46 , but they were detected in the reported Arabidopsis cp genomes. The pair of genes atpB-atpE was observed to overlap each other by ~4 bp. However, psbC-psbD had a 92 bp overlap in the A. halleri ssp. gemmifera and A. lyrata ssp. petraea cp genomes, while this overlap was 53 bp in A. thaliana, 17 bp in A. arenosa, 53 bp in Gossypium 49 and 52 bp in Camellia cp genomes 50 . Similar ratios of amino acids were observed in comparison to the previously reported cp genomes [50][51][52] . The preference for a high AT content at the 3 rd codon position is consistent with the A and T concentrations reported in various terrestrial plant cp genomes 17,51,53,54 .
We found 71 and 75 repeats in the cp genomes of A. halleri ssp. gemmifera and A. lyrata ssp. petraea, respectively, which included reversed, direct and palindromic repeats. Repeat sequences are very helpful in phylogenetic studies and play a role in genome rearrangements 53,55 . Furthermore, analyses of the various cp genomes concluded that repeat sequences are essential to inducing indels and substitutions 56 . The length of direct and palindromic repeats in the A. halleri ssp. gemmifera and A. lyrata ssp. petraea cp genomes were much shorter, ranging from 30-101 bp., and similar results have been previously reported in Camellia, which has an 82 bp repeat. However, much longer repeats have been observed, such as the 132 bp and 287 bp repeats found in Poaceae and Fabaceae, respectively 57 . Previous reports suggested that sequence variation and genome rearrangement occur due to slipped strand mispairing and the improper recombination of these repeat sequences 58,59 . Furthermore, the presence of these repeats indicates that the region is a crucial hotspot for genome reconfiguration 59 . Additionally, these repeats are an informative source for developing genetic markers for A. halleri ssp. gemmifera and A. lyrata ssp. petraea for phylogenetic and population studies 55 .
During genome annotation, we noted perfect SSRs in A. halleri ssp. gemmifera and A. lyrata ssp. petraea compared to the ten other Arabidopsis species cp genomes. SSRs usually have a higher mutation rate compared to other neutral DNA regions due to slipped DNA strands. These are present in the cp genome at the highest diversity in copy numbers and are important molecular markers for plant population genetics, evolutionary, and ecological studies 60 . We looked for SSRs of 10 bp or longer, as these have been suggested to be prone to slipped strand mispairing. This is believed to be the main mutational mechanism for SSR polymorphisms 61,62 . From our SSR analysis, 227 and 229 microsatellites were found in the A. halleri ssp. gemmifera and A. lyrata ssp. petraea cp genomes, respectively. In addition, 226, 215, 214, 221, 214, 216, 213, 216, 220, and 216 SSRs were detected in A. thaliana, A. arenosa, A. cebennensis, A. pedemontana, A. arenicola, A. croatica, A. neglecta, A. petrogena, A. suecica  and A. umezawana, respectively. These findings are consistent with the previous observation that the SSRs of cp genomes are dominated by ' A' or 'T' mononucleotide repeats 16,63 . Mononucleotide, pentanucleotide and hexanucleotides repeats were also composed of ' A' or 'T' at greater frequencies, which reflects a biased base composition, with an overall A-T richness in the cp genomes 64,65 . Our findings are comparable to previous reports that showed that SSRs found in the cp genome are generally composed of polythymine (polyT) or polyadenine (polyA) repeats and infrequently contain tandem cytosine (C) and guanine (G) repeats 65 . Therefore, these SSRs contribute to the ' AT' richness of the A. halleri ssp. gemmifera and A. lyrata ssp. petraea cp genomes, as previously reported for different species 52,66 . The current analysis showed that approximately 69% (A. halleri ssp. gemmifera) and 77% (A. lyrata ssp. petraea) of SSRs were detected in non-coding regions. These results agree with previous reports that SSRs are unevenly distributed in cp genomes and might provide more information for selecting effective molecular markers for the detection of intra-and interspecific polymorphisms 67,68 .
Our results reveal that both the A. halleri ssp. gemmifera and A. lyrata ssp. petraea cp genomes share high sequence similarity with all ten Arabidopsis species. However, relatively lower identity was also observed with these species in several comparable genomic regions. In addition, similar to previously reported cp genomes 51,52,[69][70][71] , the LSC and SSC regions were less similar than the two IR regions in all Arabidopsis species cp genomes. Similar results were reported previously in various higher plant cp genomes, which suggests that the lower sequence divergence in the IR regions compared to the SC and LSC regions is possibly due to copy correction between IR sequences by gene conversion 72 . Furthermore, the non-coding regions showed greater divergence than the coding regions. The divergent regions included trnK-rps16, rpoC1, TrnL-TrnF, atpB-rbcL, accD, petA-psbJ, petD-rpoA, ccsA, rpl33, rps12, psbM, ndhD and ycf2. Similar results for these genes were reported previously 51,52 , and these results also confirm similar differences among various coding regions in the analysed species suggested by Yang et al. 73 . These results are consistent with previous reports that these divergent genes are mostly present in LSC regions and show a trend towards more rapid evolution 52 .
Regarding the IR regions, the expansion and contraction at the borders are the main reasons for size variations among cp genomes, and it plays a vital role in evolution 71,74,75 . A detailed comparison between the two IRs and two single-copy regions was performed among A. thaliana, A. arenosa, A. cebennensis, A. pedemontana, A. arenicola, A. croatica, A. neglecta, A. petrogena, A. suecica and A. umezawana in comparison to A. halleri ssp. gemmifera and A. lyrata ssp. petraea. In A. halleri ssp. gemmifera and A. lyrata ssp. petraea, the ycf1 gene is located 1030 and 1028 bp in the IR regions, respectively. This gene is the second largest gene in the plastid genome and encodes a protein of approximately 1,800 amino acids. Recently, researchers reported that ycf1 is essential for plant viability and encodes Tic214, a vital component of the Arabidopsis TIC complex 76,77 .
Recently, Dong et al. found that two regions of the plastid gene ycf1 were highly variable in flowering plants 77 . Specifically, the ycf1 (pseudogene) located in the IRb region is conserved, while the ycf1 in the SSC is highly variable. This region of the ycf1 gene is more variable than matK in most taxa investigated thus far 44,45 . Therefore, researchers used the two regions of ycf1 as a new tool to solve phylogenetic problems at the species level and for DNA barcoding of some closely related flowering plant species [78][79][80][81] . Furthermore, two regions within ycf1, ycf1a and ycf1b have been predicted to have the highest nucleotide diversity (π) at the species level within angiosperm plastid genomes 77,82 .
Chloroplast genomes have shown substantial power in studies of phylogenetics, evolution and molecular systematics. During the last decade, there have been many analyses to address phylogenetic questions at deep nodes based on comparisons of multiple protein-coding genes 83,84 and complete cp genome sequences that enhance our understanding of enigmatic evolutionary relationships among angiosperms 38 . The genus Arabidopsis has been estimated to comprise at least 9 species and 6 subspecies 85 or up to 13 species and 9 subspecies 86 depending on the taxonomic approach and the identifier. According to the Hohmann et al. 38 taxonomy, the genus Arabidopsis has been proposed to consist of as many as 26 taxa, including the model plant A. thaliana 34,86 . Continued efforts have enhanced our ability to differentiate lineages and to understand the genomic structure and phylogenetic relationships of Arabidopsis species 34 . Previous evolutionary relationships among different Arabidopsis genomes and species were estimated using nuclear and chloroplast DNA 87, 88 restriction fragment-length polymorphisms, but complete genome sequencing provides more detailed insights 34 . Recently, Novikova et al. 34 reported Illumina sequencing and phylogenetic analysis based on polymorphism data and chloroplast genome data for 26 taxa, including those presented in the current study, that constitute the genus Arabidopsis 34 . In our study, the phylogenetic positions of A. halleri ssp. gemmifera and A. lyrata ssp. petraea within the family Brassicaceae were established by sequencing complete cp genomes and analysing 70 shared genes of 29 Brassicaceae members, representing 12 genera. The phylogenetic analysis showed that the complete cp genomes and the 70 shared gene data set exhibit identical phylogenetic signals. In both the entire genome data set and the 70 shared genes data set, A. halleri ssp. gemmifera forms a single clade with A. umezawana, and A. lyrata ssp. petraea forms a single clade with A. arenicola. Similar results were described by Hohmann et al. 38 based on trnLF and ribosomal ITS data, finding that A. halleri ssp. gemmifera is a sister to A. umezawana. The results also confirmed a previous report by Hohmann et al. 38 reporting that A. lyrata ssp. petraea and A. arenicola are close relatives and that A. arenicola probably originated postglacially from an A. lyrata population 37 . Furthermore, these results are also in broad agreement with previous results reported by Novikoa et al. 34 , where A. halleri ssp. gemmifera was most closely related to A. umezawana and A. lyrata ssp. petraea formed a clade with A. arenicola 34 .
In conclusion, we assembled and analysed the complete chloroplast genomes of A. halleri ssp. gemmifera and A. lyrata ssp. petraea and compared them with other Arabidopsis species for the first time. The genome organization, gene order, GC content and codon usage were similar to those of previously reported cp genomes from the genus Arabidopsis. The location and distribution of repeat sequences was determined, and sequence divergences of cp genomes and 70 shared genes were calculated with related species. The phylogenetic analysis based on whole cp genomes and 70 shared genes yielded identical phylogenetic trees, with A. halleri ssp. gemmifera and A. lyrata ssp. petraea forming single clades with A. umezawana and A. arenicola, respectively.

Materials and Methods
Genome Sequencing and Assembly. A standard protocol of DNA extraction was followed, as described in detail by Hu et al. 24 . Pure DNA was sequenced on an Illumina HiSeq. 2000. A total of 63,528,604 and 67,938,537 raw reads were generated for A. halleri ssp. gemmifera and A. lyrata ssp. petraea, respectively, which were then trimmed and filtered using CLC Genomics Workbench v7.0 (CLC Bio, Aarhus, Denmark). Then, CLC Genomics Workbench v7.0 (CLC Bio, Aarhus, Denmark) was used for de novo genome assembly. Different k-mer sizes were evaluated, and a k-mer size of 66 provided the best results in terms of minimum members and longest average length of scaffolds. These parameters were used to generate the final assembly. Then, the resulting contigs were compared against the A. thaliana chloroplast genome using BLASTN with an E-value cutoff of 1e-5. Six and seven contigs of A. halleri ssp. gemmifera and A. lyrata ssp. petraea, respectively, were identified and were temporarily arranged based on their mapping positions on the reference genome. Then, primers were designed (Table S1) based on the sequence at the ends of the adjacent contigs. PCR amplification and subsequent DNA sequencing were used to fill the gaps. PCR amplification was performed in a total volume of 20 μl containing 1 × reaction buffer, 0.1 μl Taq DNA Polymerase, 0.4 μl dNTP (10 mM) and 1 μl (10 ng/μl) of DNA. The PCR programme was composed of an initial denaturation at 95 °C for 5 min followed by 32 cycles at 95 °C for 30 s, 60 °C for 20 s and 72 °C for 30 s, with a final extension step at 72 °C for 5 min. After incorporation of the Sanger sequencing results, the completed cp genome was used as a reference to map the initial short reads to refine the assembly based on maximum sequence coverage.
Genome Annotation and Sequence Architecture. A program (DOGMA) was used to annotate the A. halleri ssp. gemmifera and A. lyrata ssp. petraea cp genomes 89 . The annotation results were checked manually, and codon positions were adjusted by comparing to homologues from the database A. thaliana cp genome. All transfer RNAs were verified using tRNAscan-SE version 1.21 90 using the default settings. OGDRAW 91 was used to illustrate structural features of the A. halleri ssp. gemmifera and A. lyrata ssp. petraea cp genomes. The relative synonymous codon usage (RSCU) was determined using MEGA 6.0 92 to examine deviations in synonymous codon usage by avoiding the influence of amino acid composition. The software mVISTA was used in Shuffle-LAGAN mode to compare the whole genome variations of A. halleri ssp. gemmifera and A. lyrata ssp. petraea genome with four other cp genomes using the A. halleri ssp. gemmifera and A. lyrata ssp. petraea annotation as a reference 93 .

Characterization of Repeat sequence and SSRs.
We used REPuter to identify repeat sequences, including palindromic, reverse, and direct repeats, within the cp genome 94 . The following settings for repeat identification were used in REPuter: 1) Hamming distance of 3, 2) 90% or greater sequence identity, and 3) a minimum repeat size of 30 bp. Phobos version 3.3.12 95 was used to detect (SSRs) within the cp genome, with the search parameters set at ≥10 repeat units for mononucleotides, ≥8 repeat units for dinucleotides, ≥4 repeat units for trinucleotides and tetranucleotides, and ≥3 repeat units for pentanucleotide and hexanucleotide SSRs. Tandem repeats in the A. lyrata and A. halleri cp genomes were identified using Tandem Repeats Finder version 4.07 b with default settings 96 . Sequence Divergence and Phylogenetic Analysis. Complete cp genomes and a separate partition containing only the 70 shared genes were used to analyse the average pairwise sequence divergence for ten Arabidopsis species: A. thaliana, A. arenosa, A. cebennensis, A. pedemontana, A. arenicola, A. croatica, A. neglecta, A. petrogena, A. suecica and A. umezawana. Missing and ambiguous gene annotations were confirmed by comparative sequence analysis after a multiple sequence alignment and gene order comparison. These regions were aligned using MAFFT (version 7.222) 97 with default parameters. Kimura's two parameter (K2P) model was selected to calculate pairwise sequence divergences 98 . To resolve the A. halleri ssp. gemmifera and A. lyrata ssp. petraea phylogenetic positions within the family Brassicaceae, twenty-nine published cp genomes were downloaded from the NCBI database for analysis. First, multiple alignments were performed using complete cp genomes based on the conserved structure and gene order of chloroplast genomes 98 . Four methods were employed to construct the phylogenetic trees, including Bayesian inference (BI) implemented with MrBayes 3.1.2 99 , maximum parsimony (MP) with PAUP 4.0 100 , maximum likelihood (ML) and neighbour-joining (NJ) with MEGA 6 92 using settings derived from Wu et al. 101 . MP was run using a heuristic search with 1000 random addition sequence replicates with the tree-bisection-reconnection (TBR) branch-swapping tree search criterion. Parameters for the ML analysis were optimized with a BIONJ tree as the starting tree with 1000 bootstrap replicates using the Kimura 2-parameter model with gamma-distributed rate heterogeneity and invariant sites. For Bayesian posterior probabilities (PP) in the BI analyses, the best substitution model GTR + G model was tested according to the Akaike information criterion (AIC) by jModelTest verion 2 102 . The Markov Chain Monto Carlo (MCMC) was run for 1,000,000 generations with 4 incrementally heated chains, starting from random trees and sampling 1 out of every 100 generations. The first 25% of trees were discarded as burn-in to estimate the value of posterior probabilities. In the second phylogenetic analysis, 70 shared genes from the cp genomes of the twenty-two Brassicaceae members, with Carica papaya as the outgroup species, were aligned in ClustalX using the default settings, followed by manual adjustment to preserve reading frames. The above four phylogenetic-inference methods were used to infer trees from these 65 concatenated genes using the same settings described above and in Yao et al. 103 .