Introduction

The genus Cerastium belonging to the Caryophyllaceae family contains over 200 species1 and consists of herbaceous plants, annuals, and perennials2,3 that occur mainly in the Northern Hemisphere. The genus is most common in temperate and cold regions, especially at high elevations, with Eurasia serving as its center of diversity. The majority of the representatives of the genus Cerastium have a limited range with only a few species characterized by a cosmopolitan distribution3. The current state of knowledge about the genetic diversity of the genus Cerastium is based on a rather limited number of studies. Genetic investigations employed analysis of isoenzymatic polymorphism4,5 and different molecular markers like RAPD and SCAR6, AFLP7,8 and iPBS9. Apart from traditional genetic diversity studies, there are also papers that focused on the role of hybridization and introgression events in the evolution of the genus Cerastium10,11,12. One example of the intricated systematics within the genus Cerastium is the C. alpinumC. arcticum complex. Vast physical variation within that complex has resulted in the identification of numerous species, subspecies, and varieties within that group of plants13,14,15,16, among which C. alpinum L., C. arcticum Lange and C. nigrescens (H.C. Watson) can be found17,18,19. C. alpinum is an arctic-alpine species that occurs in the northern part of North America and Europe. Moreover, it is recorded in Europe on high mountain grasslands, mostly in the subalpine zone (from 1480 to 1680 m.a.s.l.), where it forms one-species aggregations20,21. C. arcticum appeared here as the most problematic component of the species group. Latest genetic and morphological analyses suggest that the species conventionally known as C. arcticum actually consists of two separate taxa: C. arcticum s. str. and C. nigrescens10,18,19. The first of them is restricted to arctic areas (the Canadian Arctic, Greenland, Svalbard, north-western arctic Russian islands), while the other is characteristic to fell regions (the British Isles, Fennoscandian mountains, Faeroe Islands, Iceland). Despite the intensive studies delimitation of these taxa is still problematic on a large geographic scale11. Consequently, a novel approach is needed to find a universal marker for taxon identification.

Due to the recent progress observed in molecular sciences, high-throughput genome sequencing technologies have become widely available and provide a relatively fast and inexpensive way of obtaining high-quality genomic data. In case of the plant genetics, chloroplast (cp) genomes became a source of data commonly used in comparative studies22,23, biotechnology24, species identification25,26 or in analyses addressing phylogenetic questions27,28. It was shown that the complete cp genome contains roughly equivalent amount of information as the cox1 gene used in animals, so it has the potential to provide enough distinguishing differences that enable molecular identification of even closely related species29. Using the entire chloroplast genome as a super-barcode is a novel approach that could potentially address the limitations of conventional two-locus barcoding30. Traditional barcoding primarily relies on sequence variation within two regions of the chloroplast genome, matK and rbcL, which is not always sufficient for precise species delimitation. To date, there are only two publicly available chloroplast genome sequences for the genus Cerastium, i.e. complete cp genome sequence for C. glomeratum and partial genome sequence for C. arvense (NC_066897 and MH627219, respectively; NCBI). The available data revealed that the Cerastium chloroplast genome has conserved quadripartite structure with size and gene content typical for angiosperms.. Except for the above-mentioned C. glomeratum, there are 60 other species (representing the following genera: Agrostemma, Arenaria, Colobanthus, Dianthus, Gymnocarpos, Gypsophila, Lychnis, Myosoton, Paronychia, Psammosilene, Pseudostellaria, Silene, and Stellaria) for which complete plastome sequences are available in the NCBI database (accessed on March 24, 2023). Considering the fact, that the Caryophyllaceae family consists of 40 genera and includes about 12 500 species, the number of chloroplast genomes currently available for this group of plants should be treated as very low.

The complete chloroplast genomes of three Cerastium species (C. alpinum, C. arcticum and C. nigrescens) have been sequenced and annotated for the first time in this paper. The specific objectives of this study included: (1) determination of the size and structure of cp genomes for C. alpinum, C. arcticum and C. nigrescens, (2) identification of genomic repeats, including forward, reverse, palindromic and complementary sequences among Cerastium chloroplast genomes, (3) identification and characterization of simple sequence repeats (SSRs) in newly sequenced Cerastium plastomes, (4) analysis of the evolution and dynamics of chloroplast protein-coding sequences, (5) comparative study of all available Cerastium chloroplast genomes, and (6) reconstruction of the phylogenetic relationships within genus Cerastium and family Caryophyllaceae based on plastome sequences.

Results

Organization of chloroplast genomes

NovaSeq Illumina platform was applied for chloroplast genome sequencing of three Cerastium species. The highest number of raw reads (15,370,470) was obtained for C. arcticum, whereas in the case of C. nigrescens and C. alpinum sequencing yielded 13,970,724 and 13,586,446 reads, respectively. The raw reads were then mapped separately to the reference chloroplast genome of C. glomeratum. As a result, 297,869 mapped reads with a mean coverage of 304 were observed for C. nigrescens, while in the case of two other species these value were more than twice as high and amounted to 650,075 reads and 664 coverage for C. alpinum and 664,454 reads and 675 coverage for C. arcticum (Supplementary FigureS1). The size of reported cp genomes was 147,945 for C. alpinum, 147,940 bp for C. nigrescens and 148,722 bp for C. arcticum. Each chloroplast genome appeared as a circular, double-stranded DNA molecule with a traditional quadripartite structure composed of Large Single Copy (LSC) and Small Single Copy (SSC) separated by a pair of Inverted Repeats (IR) regions which have identical sequences but opposite orientation (Fig. 1). The overall GC content was nearly identical in all Cerastium species: 36.51% for C. alpinum, 36.46% for C. arcticum and 36.52% for C. nigrescens (Table 1). Additionally, variant calling analysis revealed no heteroplasmy in reported chloroplast genomes.

Figure 1
figure 1

Gene map of the three Cerastium chloroplast genomes. Genes drawn inside the circle are transcribed clockwise, and those outside are transcribed counterclockwise (indicated by arrows). Differential functional gene groups are color-coded. GC content variations is shown in the middle circle.

Table 1 Summary of chloroplast genome characteristics of studied Cerastium species.

All three reported Cerastium chloroplast genomes contained an identical set of 113 genes composed of 75 protein-coding genes, 30 transfer RNA genes, four ribosomal RNA genes, and four conserved chloroplast ORFs (ycf1, ycf2, ycf3, ycf4) (Table 2). We have also identified in each IR region a sequence for the rpl23 gene which due to the internal, premature termination codon was retained rather as a nonfunctional pseudogene. Most protein-coding genes have the standard AUG as the initiation codon. The total number of codons for all protein-coding genes in the reported cp genomes was 26,115 for C. arcticum, 26,116 for C. alpinum and 26,220 for. C. nigrescens. All studied species shared similar pattern of codon usage and amino acid frequency. Leucine appeared as the dominant amino acid (10.7%), whereas cysteine was less frequently encountered (1.2%). The most abundant codon (4.46%) was ATT and the last (0.004%) were TTG (all species), ATA and CTG (C. nigrescens). CTG codon appeared only in C. nigrescens (Supplementary Table S1). Most of the genes in analyzed chloroplast genomes did not contain introns, 14 others contained one intron (atpF, ndhA, ndhB, petB, petD, rpl16, rpoC1, rps16, trnA-UGC, trnG-UCC, trnI-GAU, trnK-UUU, trnL-UAA, trnV-UAC), whereas only three genes consisted of three exons (clpP, ycf3, and rps12). Our data confirmed that the rps12 gene, coding plastid ribosomal protein S12, is a trans-splicing gene. This gene was split into three exons: the first exon (5’end of the sequence) was located in the LSC, while the second and third exons in the IRs. The smallest intron was found in the trnL-UAA (518 bp for C. arcticum and 520 bp for C. alpinum and C. nigrescens), whereas the biggest was in the trnK-UUU (2479 bp for C. arcticum and 2480 bp for C. alpinum and C. nigrescens) gene. The matK gene was positioned inside the intron of trnK-UUU. Fifty-eight protein-coding genes, 22 tRNA genes, and two conserved chloroplast ORFs (ycf3 and ycf4) were located in the LSC region, SSC region contained eleven protein-coding genes, one tRNA gene, and one chloroplast ORF (ycf1, located on the boundary between SSC and IRB), whereas repeated IR region contained six protein-coding genes (including rps19 gene located on the boundary between IRA and LSC), seven tRNA genes, four rRNA genes, and one chloroplast ORF (ycf2).

Table 2 List of genes present in chloroplast genome of Cerastium.

The boundaries between IR and LSC/SSC regions were identified (Fig. 2). In the case of plastomes of C. alpinum, C. arcticum, and C. nigrescens the complete sequence of ycf1 gene was located on the boundary between IRA and SSC, and its incomplete copy on IRB/SSC boundary where it functions as a pseudogene (Ψycf1). Ψycf1 was overlapped (89 bp) with the ndhF gene. The IRA/SSC boundary was located within ycf1 sequence 1822 bp from its 5’ end. The IRB/LSC boundary was found within the rps19 gene (52–54 bp from its 3’ end, depending on the species). Its shorter copy was located at the IRA/LSC boundary, where it acts as a pseudogene (Ψrps19). Ψrps19 was overlapped (19 bp) with trnH gene. The trnH gene was near the IRA/LSC border (11 bp apart in case of C. alpinum and C. nigrescens and 9 bp for C. arcticum). The localization of IR and LSC/SSC boundaries was also analyzed for C. glomeratum. In the case of this species the analyzed boundaries were identified within the same genetic elements. The IRB/LSC boundary was located within the rps19 gene (65 bp from its 3’ end) and pseudogene Ψrps19 was found at the IRA/LSC boundary. However, Ψrps19 did not overlap with trnH. Analysis of the IRA/SSC and IRB/SSC boundaries revealed the inversion of the entire SSC region in C. glomeratum cp genome. The IRB/SSC border was located within ycf1 gene (1867 bp from its 5’ end) whereas the IRA/SSC was within the ndhF gene (57 bp from its 3’ end). The sequence for Ψycf1 was not annotated in the analyzed plastome. Finally, the trnH gene was located 25 bp apart from the IRA/LSC border.

Figure 2
figure 2

Comparison of LSC, SSC, and IR boundaries of four Cerastium chloroplast genomes.

Repetitive sequences and SSRs

The analysis of genomic repeats in the cp genomes of studied Cerastium species (C. alpinum, C. arcticum, C. nigrescens, and C. glomeratum revealed 79 repetitive sequences with lengths ranging from 30 to 170 bp (Supplementary Table S2A–D). The number of repeats was the highest (23) in C. arcticum and the lowest (16) in C. glomeratum. Palindromic repeats dominated among identified sequences (from 47.8% in C. arcticum to 62.5% in C. glomeratum), followed by forward repeats (from 31.3% in C. glomeratum to 47.8% in C. arcticum) and reverse repeats (from 4.3% in C. arcticum to 6.3% in C. glomeratum) (Fig. 3b). No complementary repeats were found in analyzed chloroplast genomes. Most repeat sequences (80%) were found in the LSC region, and the remaining repeats were equally distributed (10%) in IR and SSC regions (Fig. 3c). Repeats with a length of 30–40 bp were the most frequent in each species (from 13 in C. glomeratum to 18 in C. alpinum, C. arcticum, and C. nigrescens) (Fig. 3a).

Figure 3
figure 3

Number of repeat types and their distribution in four Cerastium species. (a) Length of the repeats; (b) types of repeats; (c) location of repeat sequences. F, P, R represent forward, palindromic and reverse repeats.

Application of the Phobos software revealed from 20 (C. nigrescens) to 23 (C. arcticum) chloroplast microsatellites (Fig. 4a), including mono-, di-, tri-, tetra-, penta- and hexanucleotide SSRs (Fig. 4b, Supplementary Table S3A–D). The mononucleotide SSRs, all composed of A/T repeat units, were the most common in each species with a frequency ranging from 36.4% (C. glomeratum) to 43.5% (C. arcticum). The second most common motif among identified SSRs was AAT/TAA with a frequency ranging from 27.3% (C. glomeratum) to 35% (C. nigrescens). Tetranucleotide SSRs, with frequency ranging from 19% (C. alpinum) to 31.8% (C. glomeratum) were composed of AAAT/TAAA, AATT/TTAA, ACCT/TCCA, AGAT/TAGA, and AAAG/GAAA motifs. Among identified chloroplast microsatellites there was only one SSR that contained a dinucleotide motif (AT/TA, C. glomeratum), one SSR with pentanucleotide motif (AATAT/TATAA, C. alpinum), and one SSR built of hexanucleotide motif (AAATCC/CCTAAA, C. arcticum). A substantial number of SSRs were identified in the LSC region (from 72.7% in C. glomeratum to 76.2% in C. alpinum), followed by SSC (from 14.3% in C. alpinum to 18.2% in C. glomeratum) and IR regions (from 8.7% in C. arcticum to 10% in C. nigrescens) (Fig. 4c). SSRs were mainly located within intergenic spacers (from 55% in C. nigrescens to 61.9% in C. alpinum), whereas the remaining microsatellites were distributed within exons (from 31.8% in C. glomeratum to 35% in C. nigrescens) and introns (from 4.8% in C. alpinum to 10% in C. nigrescens) (Fig. 4d).

Figure 4
figure 4

The distribution and type of simple sequence repeats (SSRs) in cp genomes of four Cerastium species. (a) Number of different SSRs types; (b) distribution of SSR motifs in different repeat class types; (c) location of different SSRs in IR, SSC and LSC regions; (d) partition of SSRs among IGS, introns and exons.

Synonymous (Ks) and non-synonymous (Ka) substitution rate analysis

The substitution rate varied across genes in each functional group and ranged from 0 to 0.151 and from 0 to 0.0858 for Ka and Ks, respectively (Supplementary Table S4). The highest average value of Ka (0.0062) was noted in the group of “other genes” and the lowest (0.0012 and 0.0014) in genes related to the cytochrome b/f complex and photosystem II, respectively. The highest average value of Ks (0.0247) was noted in gene for RubisCO large subunit, and the lowest in genes associated with the small subunit of ribosome (0.0159) and subunits of ATP synthase (0.0160). In summary, no differences (Ka = 0 and Ks = 0) were observed in the sequences of 11 genes, whereas only synonymous substitutions (Ka = 0) were observed in 18 genes. The Ka/Ks ratio was less than 1 in all genes, excluding ndhB (2.7250 for C. arvense). Relatively high values of Ka/Ks were observed in rpl22 (0.8673) for all studied species and in rps14 (0.8776) for C. arvense. In the remaining cases, the values did not exceed 0.75 (Fig. 5).

Figure 5
figure 5

Circular visualization of the plastome comprehensive analyses of three Cerastium species (C. nigrescens, C. arcticum, and C. alpinum). The first outer track represents the chloroplast gene symbols. The second line track (A) shows haplotype diversity (π) values calculated for sliding window equal to 800 bp. The red part of the line plot depicts regions with the highest diversity (π > 0.015). Histograms (B) show comparative Ka/Ks ratio values for Cerastium species, where blue, red, green and black colors depict the dominant Ka/Ks values in C. arcticum, C. nigrescens, equal for C. alpinum and C. nigrescens, and equal for all three species, respectively. Both scatter plots show the number of potential C > U and U > C editing sites within each plastid gene (C,D, respectively). The colors describe higher numbers of RNA editing sites in C. arcticum (blue points) and C. nigrescens and C. alpinum (green points) in comparison to other compared species.

Genomic comparative and nucleotide diversity analyses

The MAUVE results revealed a highly conservative structure of chloroplast genomes of C. alpinum, C. arcticum, C. nigrescens, and C. arvense for which no rearrangements (inversions or translocations) were detected. Only in the case of C. glomeratum the opposite orientation of the whole SSC region was observed (Supplementary Fig. S2).

Nucleotide diversity (π) in the analyzed cp genomes of Cerastium species was determined at 0.00493. The results of sliding window analysis showed that the π value for studied Cerastium cp genomes varied from 0 to 0.02708 (Fig. 5). Nine highly variable (π > 0.015) regions were identified in analyzed cp genomes: rpl32trnL-UAG, ndhA (intron), rps16 (intron), trnD-GUC–trnY-GUA, trnF-GAA–ndhJ, ndhC–trnV-UAC, petA–psbJ, psbE–petL, and trnP-UGG–psaJ. The highest π value (0.02708) was observed for trnD-GUC–trnY-GUA region. All of these divergent hotspots were identified in non-coding regions i.e. intergenic spacers and introns. Furthermore, the majority of highly variable regions (7) were identified in LSC, followed by two such regions in SSC, and none in the IR region (Fig. 5).

Prediction of RNA editing sites

Prediction of RNA editing sites with the use of PREPACT 3.0 tool revealed from 578 to 588 editing sites in 63 protein-coding genes (Fig. 5, Supplementary Table S5A–D). The lowest number of predicted RNA editing sites (578) was found for C. alpinum and C. nigrescens, whereas the highest was for C. glomeratum. In the case of the C. arcticum the number of RNA editing sites was 583. Among identified editing events both C to U and U to C conversions were found. In the case of 14 genes no such changes were identified. The C to U conversion accounted for 43.05% to 43.54% of total RNA editing sites, while U to C substitutions were responsible for 56.46% to 56.95% of the identified editing events. All predicted RNA editing sites resulted in non-synonymous mutations. Forty-seven (47.17–47.28%) percent of the substitutions were found at the first position of the codon, 53% (52.72–52.83%) were found at the second position, and none were found at the third position. Among predicted RNA editing events there were also conversions that involved two sites of RNA editing within one codon. Eighteen such editing events were identified in the case of C. alpinum and C. nigrescens, and 20 for C. arcticum and C. glomeratum. Most of these events involved conversions of UCU and UCC codons for serine (S) into CUU and CUC triplets for leucine (L) and back from leucine to serine, and also conversion of UUU and UUC for phenylalanine (F) to CCU and CCC for proline (P), and in the opposite direction i.e., from proline to phenylalanine. The highest number of predicted RNA editing sites were reported for ycf1 (85–88), ycf2 (77), and rpoC2 (64–65) genes. The most often substitution in each species was phenylalanine (F) to leucine (L) change (16.48–16.75%), whereas P (proline) to F (phenylalanine) and R (arginine) to W (tryptophan) changes were observed with the lowest frequencies (0.353–0.358% and 0.881–0.896%, respectively). Additionally, the conversion of the termination codon UAA to CAA triplet encoding glutamine was found to be created by RNA editing in ndhI gene for C. arcticum.

Additionally, we conducted the same investigation for chloroplast genes of C. arvense. Unfortunately, due to incomplete sequences available for rpl20, rpoB, rpoC1, rpoC2, ycf1, and ycf2, these genes were not included in the analysis. In 17 out of 71 analyzed genes, we did not identify potential RNA editing sites. In the remaining 54 genes we found 286 editing sites (Fig. 5, Supplementary Table S5E). Further, for this species, both C to U and U to C conversions were found, but U to C edition dominated (56.46%). The highest number of substitutions were observed for the first (53.06%) and the second (46.94%) position of the codon, whereas they were absent in the third position. Analogous to the situation described above for C. alpinum, C. arcticum, C. nigrescens, and C. glomeratum also here, for C. arvense, among predicted RNA editing events we found conversions that involved two sites of RNA editing within one codon. There were seven situations in which CUU and CUA codons for leucine (L) were changed into UCU and UCA for serine (S), and backward from serine to leucine. The highest number of predicted RNA editing sites were identified within sequences for matK (40) and ndhF (37) genes. All the identified RNA edition events caused non-synonymous mutations. The change from phenylalanine (F) to leucine (L) was the most abundant substitution (18.53%), whereas leucine (L) to proline (P) and arginine (R) to cysteine (C) were observed with the lowest frequency (0.7%).

Phylogenetic analysis

In the BI tree, a very high Bayesian posterior probability value (≥ 0.92) was reached in 96.4% of the nodes (53 out of 55). The reconstructed tree supported the taxonomic position of the studied species and revealed the following relationships: all Silene species together with Lychnis wilfordii formed one clad which gathered only representatives of Sileneae tribe; a second clad was formed by the representatives of the Caryophylleae tribe i.e., five Dianthus species, three representatives of genus Gysophila and Psammosilene tunicoides; a third clad consisted of all representatives of Alsineae tribe (eight Pseudostellaria species, Stellaria dichotoma var. lanceolata, Myosoton aquaticum and all studied Cerastium species which formed one subgroup; a fourth clad consisted of eight Colobanthus species (Sagineae tribe) whereas Spergula arvensis (Sperguleae) and Paronychia argentea with Gymnocarpos przewalski (Paronychieae) form two separate branches. The most diverged position on the tree was occupied by A. thaliana which was used here as an outgroup (Fig. 6).

Figure 6
figure 6

Phylogenetic tree (cladogram) based on sequences of sheared 71 protein-coding genes from five Cerastium species and 54 other Caryophyllaceae representatives using Bayesian posterior probabilities (PP). Bayesian PP are given at each node.

Results of divergence time estimation suggested that the family Caryophyllaceae started to diversify ca. 74.46 millions-years ago (Mya). Later, subsequent radiation within the family Caryophyllaceae occurred: ca. 51.47 Mya Sperguleae tribe splits from the other sister clades and at ca. 48.32 Mya diversification of Sagineae tribe was observed (represented here only by one genus Colobanthus). At ca. 46.26 Mya the evolutionary paths of Alsineae and Arenariae tribes diverged from Caryophylleae and Sileneae tribes. At ca 42.26 Mya Alsineae and Arenariae split apart and c.a. 41.93 Mya Caryophylleae split from Sileneae. Diversification events at the lower taxonomic level e.g. within tribe Caryophylleae, Sileneae and Alsineae started at 30.87, 26.0 and 20.6 Mya, respectively. The genus Cerastium began to diversify at 3.66 Mya (Fig. 7).

Figure 7
figure 7

Divergence time estimation of selected Caryophyllaceae taxa. The numbers next to the nodes represent the divergence time (Mya millions years ago).

Discussion

Chloroplast genomes are a relevant resource for many genomic and biotechnological applications31. Its unique features, like lack of recombination and slower mutation rate in comparison to nuclear genomes, make the chloroplast genome a frequently used source of data in evolutionary biology32. Moreover, common use of chloroplast genome in phylogeographic studies is observed due to its uniparental inheritance that exhibits geographical structure33.

Although the genus Cerastium consists of more than 200 species1, the availability of the genomic data for this group of plants is very limited and, to date there is only one complete chloroplast genome sequence in the NCBI database for C. glomeratum. There is also a chloroplast genome sequence for another Cerastium species (C. arvense), but due to the several gaps in the intergenic spacers and lack of complete sequences for six protein-coding genes (rpl20, rpoB, rpoC1, rpoC2, ycf1, and ycf2), create constraints for the utilization of this sequence. To fill the gap in the knowledge concerning the genomics of the genus Cerastium we sequenced and annotated the plastid genomes of three species: C. alpinum, C. arcticum, and C. nigrescens. The size of reported cp genomes ranged from 147,940 (C. nigrescens) to 148,722 bp (C. arcticum) and was similar to the plastome of C. glomeratum (148,643 bp) and other angiosperms34. All three studied cp genomes share the same gene content and order and typical quadripartite structure, with a pair of inverted repeats (IR) separated by a SSC and a LSC region). Length variation in cp genomes in different groups of plants is often caused by expansions and contractions of IR regions35. In extreme cases, IR regions were completely lost by chloroplast genomes of some algae36 or one of its copies is not observed in some representatives of leguminous plants37. Consequently, the analysis of the distribution of IR/LSC and IR/SSC borders became a standard element of plastome characteristics. Obtained results revealed that their locations may differ among various species, even between closely related genera38. Analysis of reported here chloroplast genomes of C. alpinum, C. arcticum, and C. nigrescens revealed that IR/LSC and IR/SSC boundaries were located within sequences of ycf1 and rps19 genes (Fig. 2), which is analogous to the situation observed for most angiosperms39. The location of IR boundaries was identical for C. alpinum and C. nigrescens, whereas a minor shift (two bases shift within rps19 and three bases within the ycf1 gene) was observed for C. arcticum. The length of IR and SSC regions in reported plastomes was very similar and ranged from 25,507 to 25,513 bp and from 16,850 to 16,861 bp, respectively. Higher variation was found for the LSC region where the difference between the longest and the shortest LSC is 782 bp (C. arcticum vs. C. nigrescens). Nevertheless, the sizes of all three plastome regions values are consistent with previous reports for other dicotyledons40,41. For comparative purposes, the IR borders within the chloroplast genome of C. glomeratum were also examined. In this case, more differences were observed. Although the IR borders were also located within the rps19 and ycf1 genes, the eleven base shift for rps19 and 45 base shift for ycf1 was found. Additionally, only one copy of ycf1 can be found within C. glomeratum plastome at the IRB/SSC border as its incomplete copy (ψycf1) between IRA/SSC was not annotated. However, the main difference is associated with the opposite orientation of the whole SSC region. This interesting phenomenon was originally reported for Phaseolus vulgaris42. The author with the use of restriction enzyme analysis revealed, that the individual plants' chloroplast DNA demonstrates a type of heteroplasmy in which the plastomes occurs in two equimolar states (i.e., inversion isomers) that differ in the orientation of the SSC region. Later this phenomenon was confirmed in other species, e.g., Heterosigma akashiwo43, Lasthenia burkei44, and Artemisia frigida45.

Chloroplast genomes of C. alpinum, C. arcticum, and C. nigrescens contained an identical set of 113 genes which appeared to be identical with C. arvense. In the case of the cp genome of C. glomeratum lack of the psbL gene was noticed during the analyses, but reannotation of the plastome allowed us to identify the psbL sequence between psbJ and psbF genes. Furthermore, two additional genes, i.e., infA (coding translation initiation factor I) and rpl23 (encoding ribosomal protein L23) were not annotated in C. glomeratum plastome. Detailed analysis of the chloroplast genome for the species enable identification of these sequences, but their pseudogenization (i.e., the presence of internal, premature termination codons) was the most probable reason why their annotations were not considered by the original authors of the sequence. In the case of C. alpinum, C. arcticum, and C. nigrescens rpl23 was also identified as a pseudogene, whereas a complete sequence of infA gene was found and annotated. Loss of the infA gene was also observed in other species within the Caryophyllales46. In some cases, the infA gene was found to be a pseudogene, i.a. in Nicotiana tabacum47, Arabidopsis thaliana48, Oenothera elata41 and several Allium species49. In the chloroplast genomes of another Caryophyllaceae representative, i.e. Dianthus superbus var. longicalyncinus, both infA and rpl23 were retained as pseudogenes50. Pseudogenization of the rpl23 gene was also previously reported in various species, i.a. within the genus Triticum51, Hordeum52 and Secale53. The studied Cerastium cp genomes had a GC-content of 36.46–36.52%, which is comparable with other Caryophyllaceae – 36.32% in Dianthus caryophyllus54, 36.4% in Silene jenisseensis55, 36.5% in Pseudostellaria palibiniana56, P. okamotoi57, P. heterophylla58, P. longipedicellata59 and Gymnocarpos przewalskii60 and 36.7% in Colobanthus quitensis61.

The repeat regions of the genomes are of particular importance in sequence rearrangement and recombination62. The genomic repeats identified within chloroplast genomes of C. alpinum, C. arcticum, C. nigrescens, and C. glomeratum ranged from 30 to 170 bp in length and they were identified predominantly (56.3–69.6%) within non-coding regions. Similar values were reported in other Caryophyllaceae, such as C. quitensis (53.3%63;), Silene capitata (56.0%) and Lychnis wilfordii (69.2%)64. The majority of the repeats (78–90%) in all four Cerastium genomes are between 30 and 40 bp in length. Similar values were reported in other angiosperms—legumes (Glycine, Lotus, Medicago65) and cotton (Gossypium hirsutum66).

Chloroplast simple sequence repeats, or microsatellites, are repetitive genomic elements that typically consist of tandemly repeated multiple copies of mono- to hexanucleotide motifs which are usually found in the non-coding regions67. Due to their high abundance, random distribution within the genome and high polymorphism information content, they are also widely used for high-throughput genotyping68. These markers proved their usefulness in population genetics and evolutionary studies69,70. In the analyzed plastomes of four Cerastium species, the mononucleotide (A/T) repeats were the most abundant SSR motif (36.4–43.5%). The dominance of mononucleotide chloroplast SSRs has been also observed in other Caryophyllaceae, where it ranged from 44.8% in Colobanthus apetalus63 or 55.3% in C. lycopodioides71 up to 76.8% in Silene capitata or 77.6% in Lychnis wilfordii64. In turn, di- (AT/TA), penta- (AATAT/TATAA) and hexanucleotide (AAATCC/CCTAAA) microsatellites were least abundant, and only one such element was identified in C. glomeratum, C. alpinum, and C. arcticum, respectively.

The synonymous (Ks) and non-synonymous (Ka) substitution rate and their ratio (Ka/Ks) are important parameters in gene evolution studies72. Generally, in most of the coding regions synonymous nucleotide substitutions dominate over non-synonymous changes73. This was also observed in our study, where Ks values dominated over Ka which resulted in high sequence conservation. Nevertheless, there were also sequences for which considerable variation was found due to the high Ka values. The highest Ka values were observed for rpl32 (average Ka = 0.0151) and matK (average Ka = 0.0134). High variation of the matK sequence has been widely documented and it is recognized as one of the most promising barcoding sites for systematic and evolutionary studies in plants74,75. There are also studies reporting high genetic diversity in the immediate vicinity of the rpl32 gene (ndhF–rpl32 or rpl32–trnL)76,77 and the role of rpl32 gene in the evolution of chloroplast genomes which involve its complete loss, substitution or transfer to the nucleus (for review see78). Assessment of the ratio of nonsynonymous (Ka) to synonymous (Ks) substitution is widely accepted approach used to infer about the direction of the sequence evolution at the protein level (Ka/Ks > 1 indicates a positive selection, Ka/Ks < 1 indicates a negative or purifying selection, whereas Ka/Ks = 1 indicates a neutral evolution)79,80. Protein functions are maintained through purifying selection, whereas positive selection favors new gene variants which may be beneficial for organism adapting to changing environmental conditions. In the case of our study, Ka/Ks ratio of all genes was less than 1, except for ndhB (2.7250 for C. arvense), implying that this gene evolved at a faster rate and underwent positive selection. The same pattern of selection (Ka/Ks > 1.0) for ndhB gene was also reported in various species representing the family Gentianaceae (Gentiana lawrencei81), Orchidaceae (Calanthe delavayi82) and Cupressaceae (Cupressus and Juniperus species83). The group of ndh genes, encoding subunits of NADH dehydrogenase, play a key role in the use of light energy and electron transfer chain to produce ATP, an essential component for photosynthesis84. Chloroplast NADH dehydrogenase is sensitive to strong light stress and can protect plants from photoinhibition or photooxidation stress by stabilizing the NADH complex and preventing drought-related declines in photosynthetic rate and growth delay85. These observations may suggest that NADH dehydrogenase genes are involved in adaptation to environmental stresses by optimization of photosynthesis. An excess of functionally adaptive amino acid substitutions within NADH dehydrogenase genes was described previously for Poaceae86. Authors observed there the signals of positive selection acting on one-third of all chloroplast protein-coding genes (25 out of 76), including nine of the eleven genes encoding subunits of NADH dehydrogenase. In the case of our study, the signal of positive selection detected for the ndhB gene in C. arvense which might be interpreted as one of the mechanisms of physical adaptation which enabled this cosmopolitan species to colonize vast areas of Europe and North America.

Highly variable sequences found within chloroplast genomes appeared as a common source of molecular markers suitable for phylogenetic analyses and species identification87. Although traditional barcoding chloroplast regions, like matK, rbcL or intergenic spacer trnH-psbA revealed lower than expected genetic diversity, our genome-wide comparative analysis of plastomes of four Cerastium species (C. alpinum, C. arcticum, C. nigrescens, and C. glomeratum) allowed us to identify nine fast evolving regions. Among these divergent hotspots (π > 0.015) there were seven regions (trnD-GUC–trnY-GUA, trnF-GAA–ndhJ, ndhC–trnV-UAC, petA–psbJ, psbE–petL, trnP-UGG–psaJ and intron within rps16 sequence) located within LSC region and two others (rpl32trnL-UAG and intron within ndhA sequence) identified within SSC region. To the best of our knowledge, none of these chloroplast genome regions have been used to date for phylogeny reconstruction within the genus Cerastium. Nevertheless, there are several phylogenetic studies performed within various groups of plant species, including the family Caryophyllaceae, in which at least some of the regions listed above were used, e.g. intron of rps1688, petA–psbJ89 or rpl32trnL-UAG90.

RNA editing is one of the most important post-transcriptional modifications which mainly occurs in mitochondrial and chloroplast transcripts91,92. RNA editing is described as a process involved in the correction of a missense mutation of genes at the RNA level. This mechanism could alter the nucleotide sequence through insertion, deletion, or substitution of nucleotides93,94 to preserve the function of encoded proteins95. The first report of RNA editing was documented for the cox2 gene in the protozoan parasite Trypanosoma brucei96, whereas in plants RNA editing was first discovered in the sequence of cox2 of Triticum aestivum97 and then in rpl2 in maize98. Several editing sites have been reported in many other species, i.a. A. thaliana93, N. tabacum99, Oryza sativa100, Pisum sativum101 and Manihot esculenta102. RNA editing that converts cytidine into uridine (C into U) is widespread in plant organelles and occurs mostly at the first or second positions of codons103. Whereas the reverse U to C conversions is more restricted in occurrence. In studied Cerastium species the presence of both C to U and U to C editing has been revealed. RNA editing by U to C is rather rare in terrestrial plants, but it has been found in some species i.a. A. thaliana104, hornworts105, lycophytes106 and ferns107.

One of the plant groups that has been intensively studied in terms of its phylogeny is the family Caryophyllaceae. Traditionally, Caryophyllaceae was divided into three subfamilies: Alsinoideae, Caryophylloideae, and Paronychioideae108. However, the traditional taxonomy of the family encountered many difficulties, i.e., most of the genera appeared to be polyphyletic probably because many of the morphological characters evolved in parallel109. More recently, a new classification of Caryophyllaceae family based on three chloroplast regions (matK, trnL-trnF, and rps16) was proposed which divided this group into 11 tribes110. Unfortunately, only two Cerastium species (C. arvense and C. fontanum) were represented in this study and based on their molecular characteristics they were nested within the Alsineae tribe, together with representatives of the following genera: Stellaria, Pseudostellaria, Myosoton, Plettkea, Holosteum, Moenchia, and Lepyrodictis. Cerastium is one of the Caryophyllaceae genera whose structure is still intensively debated. Even determining the number of species distinguished within this group of plants is problematic and vary from 60111 or 1003,112 up to 200113 species. Phylogenetic analyses employing multiple nuclear and plastid DNA sequences have established Cerastium's monophyly13,114. However, there are still some issues associated with Cerastium systematics that need clarification, for example, the status of the C. alpinumC. arcticum complex which includes C. alpinum, C. arcticum, and C. nigrescens. Several evolutionary lineages were identified within that complex in earlier research based on morphology, isozymes, and DNA markers6,10,19. It was reported that the origin and evolution of these taxa are most likely related to the fluctuations of ice sheet range during the Quaternary glaciations which caused the extensive migrations of the species and enabled multiple hybridization and introgression events11,19,115. This hypothesis is consistent with the results of studies reporting no variation in chloroplast trnL-trnF and psbA-trnH sequences among representatives of the arctic-alpine C. alpinumC. arcticum complex and members of the boreal-temperate C. tomentosum and C. arvense groups13.

In our study, phylogenetic analysis was based on 71 concatenated protein-coding gene sequences. Revealed phylogenetic relationships between analyzed representatives of the Caryophyllaceae family were in concordance with the taxonomic position of studied species and previous phylogenies of this group109,116. Moreover, obtained results allowed us to undoubtedly discriminate all analyzed species, including five representatives of the genus Cerastium (C. alpinum, C. arcticum, C. nigrescens, C. glomeratum, and C. arvense). This is in agreement with the previous observation that a phylogenetic network that combines several genes is preferable to a single-gene tree, as the latter is typically insufficient to reveal reliable phylogenetic relationships117. All Cerastium species were gathered in one clade, but C. glomeratum appeared to be the most divergent from the other species.

Our divergence time analysis confirmed the results of the previous studies on molecular and temporal diversification of the Caryophyllace family. Analogous to the results of the latest research based on nuclear ITS region and four plastid sequences (matK, rbcL, rps16 and trnL-F)118 our studies suggested that the family Caryophyllaceae began to diversify before the end of Crecateous (ca. 74.46 Mya) and this process continued through the Paleogene and Neogene with the highest intensity of the diversification in the last 10 Mya119. According to our observations Alsineae tribe, which includes the genus Cerastium, started to diversify at 20.6 Mya, whereas the beginning of that process for the genus Cerastium was dated on ca. 3.66 Mya. Our results suggested that C. glomeratum split earliest from the other representatives of this genus, whereas the other species appeared to be on the early stages of diversification. The high similarity of studied Cerastium plastome sequences may be treated as possible evidence for weak barriers to breeding between these species which enabled spontaneous hybridization between them. Previously, interspecific hybridization events were reported for many Cerastium species8,120. Although a close relationship between C. nigrescens and C. arcticum was previously reported10,18,19, our study suggested a more divergent character of more geographically distant C. arcticum and closer genetic relationships between C. nigrescens and C. alpinum. These observations and results of previous studies pointing to possible hybridization between these two sympatric species (C. nigrescens and C. alpinum)11 showed the complexity of evolution which can take place across a broad range of scenarios and spatial circumstances121.

C. arvense was unexpectedly grouped with species from the C. alpinumC. arcticum complex. This is probably because the publicly available partial sequence of C. arvense chloroplast genome that we used in phylogenetic studies lacked complete sequences of six genes (rpl20, rpoB, rpoC1, rpoC2, ycf1, and ycf2) thus the phylogeny reconstruction was performed on the limited number (71) of plastid genes. The absence of these genes might be then responsible for the underestimation of genetic divergence between C. arvense and other Cerastium species. The application of a complete sequence of chloroplast genome appeared here as the alternative method for distinguishing the true phylogenetic relationships between these closely related taxa. This approach has already proved its usefulness for taxa with a relatively short time since the divergence event or a low rate of evolution resulted in low sequence variation31,122. Nevertheless, in both cases resequencing of C. arvense is required.

Although complete chloroplast genomes of three Cerastium species (C. alpinum, C. arcticum and C. nigrescens) were reported and characterized here for the first time further research is required to investigate and finally resolve the taxonomic issues associated with the genus Cerastium and the C. alpinumC. arcticum complex. Subsequent studies should include not only analyses of chloroplast genomes but also nucleic regions because when hybridization and polyploidy are common the resolution that chloroplast genome sequence can provide for phylogenomics research may be limited123. Nevertheless, our results proved the suitability of chloroplast genome sequences as reliable and effective DNA barcodes for Cerastium species.

Conclusion

The chloroplast genomes of Cerastium alpinum, C. arcticum, and C. nigrescens were sequenced and characterized for the first time. The reported chloroplast genomes appeared to be highly conserved in terms of the gene content and order as well as their quadripartite structure. Highly divergent regions (rpl32trnL-UAG, ndhA intron, rps16 intron, trnD-GUC–trnY-GUA, trnF-GAA–ndhJ, ndhC–trnV-UAC, petA–psbJ, psbE–petL and trnP-UGG–psaJ) and microsatellite sequences that could be potentially used as markers in genetic diversity or phylogenetic studies were identified. Reconstruction of phylogenetic relationships within the family Caryophyllaceae confirmed the previously reported systematic relations within that group of plants and supported the position of Cerastium species as a separate clad within the tribe Alsineae. Although obtained data provide insight into the evolution and biogeographic history of the genus Cerastium further studies are needed to finally elucidate the relationships between species from the C. alpinumC. arcticum complex.

Methods

Plant material, DNA extraction and chloroplast genome sequencing

The research material consisted of three Cerastium species–C. alpinum, C. arcticum, and C. nigrescens. Fresh leaves of C. alpinum and C. arcticum were harvested from plants grown from seeds in a greenhouse (Department of Plant Physiology, Genetics and Biotechnology, University of Warmia and Mazury in Olsztyn, Poland). The seeds of C. alpinum were collected in 2020 in Babia Góra National Park (Poland) after obtaining permission from the Polish Ministry of Environment. In the case of C. arcticum, seeds were collected by Michał Węgrzyn from the Institute of Botany of Jagiellonian University in Kraków, Poland, during the Arctic expedition to Nicolaus Copernicus University Polar Station in Spitsbergen in 2012. In turn, C. nigrescens individuals were collected by Keith W. Larson from Climate Impacts Research Centre, Umeå University, Sweden, in Nuolja massif (Sweden) and delivered to Olsztyn in dried form. The species identification included analysis of both vegetative and generative organs. In the case of C. alpinum and C. arcticum identification was performed by Irena Giełwanowska, whereas C. nigrescens status was verified by Keith W. Larson. Voucher specimens of each studied species have been deposited in the Vascular Plants Herbarium of the Department of Botany and Nature Protection at the University of Warmia and Mazury in Olsztyn, Poland (OLS), under the following numbers: C. alpinum (No. OLS 33837), C. arcticum (No. OLS 33840) and C. nigrescens (No. OLS 33841). The photographs of the representatives of each studied species were provided as the supplementary material: C. alpinum (Supplementary Fig. S3), C. arcticum (Supplementary Fig. S4) and C. nigrescens (Supplementary Fig. S5).

Total genomic DNA was extracted from the fresh or dried material of a single plant using Maxwell 16 LEV Plant DNA Kit (Promega, Madison, WI). The amount and purity of isolated DNA was estimated spectrophotometrically (NanoDrop ND-1000 UV/Vis; NanoDrop Technology). Additionally, the quality of DNA was verified on 1.5% (w/v) agarose in the presence of 0.5 µg/ml ethidium bromide (wavelength 300 nm; Ultra-Lum EB-20 Electronic UV Transilluminator).

The appropriate genome libraries (library kit: TruSeq DNA PCR Free (350), prepared from high-quality genomic DNA, were sequenced on Illumina NovaSeq6000 platform (Illumina Inc., San Diego, CA, USA) with a 150 bp paired-end read.

Annotation and genome analysis

The quality of raw reads was checked with the FastQC tool. Raw reads were trimmed (5 bp of each read end, regions with more than 5% probability of error per base) and mapped to the reference chloroplast genome of C. glomeratum (NC_066897) using Geneious v.R7 software124 with medium–low sensitivity settings. The details on subsequent procedures for chloroplast genome assembly and annotation were described in our previous study78. The chloroplast genomes were annotated using PlasMapper125 with manual adjustment and circular maps of chloroplast genomes were drawn using the OrganellarGenome DRAW tool126. Each chloroplast genome assembly was validated using GetOrganelle v.1.7.7.0127.

Additionally, to check for the possible presence of heteroplasmy variant calling analysis was performed in Geneious software using “Find Variations/SNPs (Single Nucleotide Polymorphism)” feature with the following parameters: minimum variant frequency = 0.1; minimum coverage = 10, p-value cut off = 0.0001 and default values for the remaining parameters.

Genomic repeats and SSR analysis

The genomic repeats, including forward, reverse, palindromic and complementary sequences were identified using REPuter software128 with the following settings: minimal repeat size of 30 bp, Hamming distance of 3, and 90% sequence identity. Chloroplast simple sequence repeats (SSR), also called microsatellites, were identified in Phobos v.3.3.12129. Only perfect SSRs with a motif size of one to six nucleotide units were considered. Additionally, we applied the standard thresholds for chloroplast SSRs’ identification130: minimum number of repeat units were set to 12, 6, 4, 3, 3, and 3 for mono-, di-, tri-, tetra-, penta- and hexanucleotides, respectively. A single IR region was used to eliminate the influence of doubled IR regions, and redundant results were deleted manually.

Comparative analysis of chloroplast genomes

Chloroplast genome sequences of three Cerastium species (C. alpinum, C. arcticum, C. nigrescens) reported in this paper and plastome sequence of C. glomeratum (NC_066897) and C. arvense (MH627219) acquired from NCBI database were used for the genome synteny analysis which was performed with the use of MAUVE v.1.1.1131. Furthermore, the sequences were aligned in MAFFT v.7.310132 to perform sliding window analysis and evaluate nucleotide diversity (π) in chloroplast genomes using DnaSP v.6.10.04133. The step size was set to 50 base pairs, and the window length was set to 800 base pairs. Here, only complete chloroplast genome sequences were used—C. arvense plastome which has several gaps in its sequence was excluded from this analysis. The results were visualized with the CIRCOS software package v.0.69–9134.

The selective pressure for genes identified in chloroplast genomes of C. alpinum, C. arcticum, C. nigrescens, C. glomeratum, and C. arvense was also analyzed. A total number of 77 protein-coding genes were selected for which synonymous (Ks) and non-synonymous (Ka) substitution rates, as well as Ka/Ks ratio, were estimated using DnaSP v.6.10.04. Cerastium glomeratum chloroplast genome was used as a reference. During the analyses, lack of psbL gene was noticed in C. glomeratum. Reannotation of the C. glomeratum plastome allowed us to identify the sequence for this lacking gene in its traditional position i.e., between psbJ and psbF (detailed location: 61391..61507). In the C. arvense cp genome all genes which were annotated in plastomes of C. alpinum, C. arcticum, C. nigrescens were also present, but unknown nucleotides (n) were recorded in six (rpl20, rpoB, rpoC1, rpoC2, ycf1, and ycf2), therefore these sequences were excluded from calculations for this species. The results were visualized with the CIRCOS software package v.0.69-9134.

The junction sites between LSC, SSC, and IRs regions were also identified and compared. Additionally, data on the codon usage distribution was acquired from the Geneious v.R7 statistic panel.

Prediction of RNA editing sites

Potential RNA editing sites in the protein-coding genes from chloroplast genomes of C. alpinum, C. arcticum, C. nigrescens, C. glomeratum, and C. arvense were predicted using PREPACT 3.0 tool135. Arabidopsis thaliana (NC_000932) was used as a reference for BLASTx prediction, both forward (C to U) and reverse (U to C) editing options were selected, while the remaining settings were kept at default (0.001 e-value cutoff and 30% filter threshold). In the case of C. arvense rpl20, rpoB, rpoC1, rpoC2, ycf1, and ycf2 genes were excluded from the analysis as unknown nucleotides (n) were recorded in their sequences. The results were visualized with the CIRCOS software package v.0.69–9134.

Phylogenetic analysis

Chloroplast genome sequences of three Cerastium species (C. alpinum, C. arcticum, and C. nigrescens) reported in this paper, as well as 56 plastomes of other representatives of family Caryophyllaceae (including C. glomeratum and C. arvense) and A. thaliana (outgroup), were used for phylogenetic analysis (Table 3). Initially, the sequences of 71 protein-coding genes shared by all these species were extracted using a custom R script. Then, the concatenated sequences of 71 genes were aligned in MAFFT v7.310 and used for phylogeny reconstruction by Bayesian Inference (BI). The Mega v.7 software136 was used to determine the best-fitting substitution model, and the GTR + G + I model was selected. The BI analysis was conducted using MrBayes v.3.2.6137,138, according to the parameter’s settings described in our previous paper63. The obtained phylogenetic tree was used as a starting tree for divergence time analysis performed using RelTimeML feature in MEGA 7 with GTR model. The divergence time between Cerastium arvense and Myosoton aquaticum (6.2–38.1 Mya), Arenaria serpyllifolia and Pseudostellaria japonica (20.3–83.4 Mya) and Dianthus chinensis and Silene latifolia (20.3–46.7 Mya) obtained in TimeTree139 were used as calibration constraints in calculations.

Table 3 GenBank accession numbers and references for chloroplast genomes used in this study. Species list arranged alphabetically.

Ethics declarations

Authors confirm that the use of plants in the present study complies with international, national and/or institutional guidelines.