Comparative chloroplast genome analysis of seven extant Citrullus species insight into genetic variation, phylogenetic relationships, and selective pressure

Citrullus ecirrhosus, Citrullus rehmii, and Citrullus naudinianus are three important related wild species of watermelon in the genus Citrullus, and their morphological differences are clear, however, their chloroplast genome differences remain unknown. This study is the first to assemble, analyze, and publish the complete chloroplast genomes of C. ecirrhosus, C. rehmii, and C. naudinianus. A comparative analysis was then conducted among the complete chloroplast genomes of seven extant Citrullus species, and the results demonstrated that the average genome sizes of Citrullus is 157,005 bp, a total of 130–133 annotated genes were identified, including 8 rRNA, 37 tRNA and 85–88 protein-encoding genes. Their gene content, order, and genome structure were similar. However, noncoding regions were more divergent than coding regions, and rps16-trnQ was a hypervariable fragment. Thirty-four polymorphic SSRs, 1,271 SNPs and 234 INDELs were identified. Phylogenetic trees revealed a clear phylogenetic relationship of Citrullus species, and the developed molecular markers (SNPs and rps16-trnQ) could be used for taxonomy in Citrullus. Three genes (atpB, clpP1, and rpoC2) were identified to undergo selection and would promote the environmental adaptation of Citrullus.


Results
Chloroplast genome generation, annotation, and comparative feature analysis. The sequence data (0.8 Gb) of C. ecirrhosus, C. rehmii, and C. naudinianus were extracted randomly, and approximately 13.27% (C. ecirrhosus), 21.37% (C. rehmii) and 12.75% (C. naudinianus) of these clean paired reads mapped to the CPG of C. lanatus subsp. vulgaris. The CPGs of these three Citrullus species were obtained after de novo assembly and strict validation. Then, they were annotated and archived in GenBank with accession number ON597627, MZ577999, and MZ577998.
To reveal the structural features of seven CPGs in the genus Citrullus, comparative analysis was conducted ( Table 1), The average length of the CPGs is 157,005 bp, ranging from 156,906 bp (C. lanatus subsp. vulgaris) to 157,147 bp (C. colocynthis), and they demonstrated a typical quadripartite structure, including the LSC (86,728-87,007 bp) and SSC (17,999 bp) regions, and a pair of IRs (26,149 bp), but the variation coefficient of the SSC regions (2.57E−3) was greater than that of the LSC regions (9.56E−4) and IR regions Table 1. Summary statistics for the assembly of seven CPGs in the genus Citrullus. LSC Large single copy, SSC small single copy, IR inverted repeat. *Indicated the chloroplast genome of species was assembled in the study. www.nature.com/scientificreports/ (1.42E−3). The average GC content (%) in the seven CPGs was approximately 37. 16. Further analysis of GC content found that the IR region had the highest GC content compared to any other region in each CPG (Table 1). When the gene content of chloroplast genomes was compared, we found some mistakes in the original annotation of C. lanatus subsp. vulgaris and C. mucosospermus, then we updated the chloroplast genome annotation of C. lanatus subsp. vulgaris at NCBI. After correcting the chloroplast genome annotation of C. lanatus subsp. vulgaris and C. mucosospermus, we found that the seven CPGs encoded a set of 130-133 functional genes, including 85-88 protein-encoding genes, 37 transfer RNAs (tRNA), and 8 ribosomal RNAs (rRNA). Furthermore, 79 protein-encoding genes were identified among seven Citrullus species. Twenty-three genes were found to harbor introns, four of which (two rps12 gene, one clpP gene, and one ycf3 gene) had two introns. A consensus radius map of the seven CPGs was plotted to explore the variation in the IR/SC boundary ( Supplementary Fig. S1), suggested that there was no the significant variation in the IR/SC boundary. Moreover, fine collinearity was observed among the seven CPGs of Citrullus species (Supplementary Fig. S2). Based on the conclusion that the seven CPGs have similar total gene numbers, names, and orders, their annotated CPGs of the seven species were drawn in a circle map (Fig. 1).
Chloroplast genome sequence divergence and nucleotide diversity analysis. www.nature.com/scientificreports/ mVISTA and MEGA software with the CPG of C. lanatus subsp. vulgaris as a reference. The results showed that intergenic or noncoding regions were more divergent than gene encoding regions, and the largest sequence divergence was found between C. lanatus subsp. vulgaris and C. naudinianus, the smallest divergence was found between C. lanatus subsp. vulgaris and C. mucosospermus (Fig. 2). Their sequence divergence was also validated by the pairwise distance. The pairwise distances among the seven Citrullus species ranged from 0.0004 to 0.00641 (Supplementary Table S1). The largest sequence divergence of 0.00641 was found between C. naudinianus and C. lanatus subsp. vulgaris or C. mucosospermus, whereas the smallest sequence divergence of 0.0004 was found between C. lanatus subsp. vulgaris and C. mucosospermus.
To explore the nucleotide diversity (Pi value) at the different regions of the CPGs, the nucleotide diversity of the seven CPGs and the consensus protein coding regions in the Citrullus genus were calculated by DnaSP, and the data plot showed that the LSC and SSC regions were more divergent than the two IR regions (Fig. 3). The intergenic regions (rps16-trnQ (0.013), trnS-trnG (0.012), rps15-ycf1 (0.012), rpl32-trnL (0.011), and psbE-petL (0.010)) were the top five highest variables among the noncoding regions ( Table 2), but the highest Pi value of all the protein gene coding regions was 0.00754 (rpl32), which was less than that of most noncoding regions (Supplementary Table S2). After phylogenetic relationship analysis based on the five hypervariable fragments ( Supplementary Fig. S3), we found that all bootstrap values on the phylogenetic tree of rps16-trnQ were greater than 75%, and the topological structure was the same as that in previous studies, showing that the successful identification ratio of rps16-trnQ is 100%, while the other four hypervariable fragments had low success identification ratios, suggesting that only rps16-trnQ could be used as a high-efficiency molecular marker or DNA barcode for Citrullus species identification.
Repeat sequence analyses. Repeat sequences are commonly used as an important molecular source for understanding species phylogenetic relationships, yet they can also act a significant role in genome rearrangement 18 (27) was the highest, making the total number of repeat sequences in C. naudinianus (116) significantly greater than those in the other species, while the number of repeat sequences was not significantly different among the other Citrullus species.  Table S4). The average SSR number was 56 (Supplementary Table S5). Mono-, di-, tri-, tetra-, penta-, and hexanucleotide SSRs in each species were analyzed, but only mono-, di-, tri-, and tetra-nucleotide SSRs were found (Fig. 4B), and the average number and proportions of mono-, di-, tri-, and tetra-nucleotide SSRs were 40(71.6%), 12(21.1%), 2(3.6%), and 2(3.6%), respectively (Supplementary Table S5). Notably, no penta-or hexa-nucleotide SSRs were found in the seven CPGs of Citrullus, and the number of mononucleotide SSRs in C. naudinianus (45) was the highest among the seven species. Nevertheless, the SSRs in these chloroplast genomes were remarkably rich in A/T content, with lower G/C content. The A/T mononucleotide repeats constituted a large portion of the SSRs in all species. In addition, we found that the seven species shared 47 SSRs, and 34 of those SSRs were polymorphic (Table S6). Three selected SSR markers were polymorphic among C. lanatus subsp. vulgaris, C. mucosospermus and C. amarus (Fig. 5, Supplementary Fig. S4), suggesting that these SSR markers could be applied for species classification in Citrullus.
Chloroplast genome variation map. To identify the variation in the chloroplast genome between cultivated watermelon and the other six Citrullus species, 1271 single nucleotide polymorphisms (SNPs) were identified using VarScan, of which 769 (61.5%) SNPs were in noncoding regions, such as introns, and upstream and downstream regions of genes, and approximately 39.5% of SNPs were in coding regions, which led to synonymous variants and missense variants (Table 3). Notably, except for synonymous variants, 242 (19.0%) SNPs in coding regions resulted in variants of 43 protein coding genes (Supplementary Table S7). Of these genes, the accD, clpP, ndhF, rpoC2, ycf1, and ycf2 harbor more than 10 missense SNPs. Notably, only four SNPs were identified between cultivated watermelon and its nearest wild species (C. mucosospermus), and only one SNP (A/T at loci 78,734) was identified that altered the CDS of petB, which was validated by targeted pyrosequencing assays (Fig. 6), other three SNPs were in noncoding regions.
We also identified 234 small insertions or deletions (INDELs) among the chloroplast genomes of seven species. Most INDELs were in the intergenic regions and introns, and only 9 (3.8%) INDELs were located in six protein-coding regions of the matk, rpl33, infA, rps19, and ycf1 genes (Supplementary Table S8). Notably, matk, infA, and rps19 contain both variations, and a total of 45 protein-coding regions were altered by SNP or INDEL.     www.nature.com/scientificreports/ Phylogenetic relationships analysis. The phylogenetic relationships of seven Citrullus species were analyzed with the ML and BI methods based on the CPG, CDS, SNP and DNA barcode, and the resulting topologies of the seven cucurbit plants were largely congruent, with highly supported values ( Fig. 7 and Supplementary  Fig. S5). Figure 7 shows the phylogenetic tree of CPG generated by the ML analysis, including two types of supporting values: the ML bootstrap (ML-BS), and BI posterior probability (BI-PP). All ML-BS and BI-PP values on the clade were 100 and 1.0, respectively. The phylogenetic tree revealed that the seven Citrullus species were grouped into a clade, and the C. naudinianus was at the base of the clade, followed by C. colocynthis, C. rehmii, C. ecirrhosus, C. amarus, C. mucososperms, and C. lanatus subsp. vulgaris.
Selective pressure events. The ratio (ω) was calculated to evaluate the selective pressure on the 79-consensus protein-encoding genes of the seven CPGs of Citrullus. Three genes (atpB, clpP, and rpoC2) were detected to be subject positive to selective pressure by EasyCodeML ( Table 4). The ω in the M2a model (ω2) changed from 177.5821 to 429.3408, and of the three genes subject to positive selective pressure, clpP had the greatest ω2 value, which indicated that clpP could have undergone intense positive selection. Meanwhile, the amino acid sites in the three genes under positive selective pressure were determined, revealing that these genes harbor one intensively positive selective site (Table 4). Further analysis using the Ka/Ks calculator in Tbtools showed that the clpP1 gene was positively selected (ω > 1) between C. lanatus subsp. vulgaris and C. naudinianus (Supplementary  Table S9).

Discussion
In our study, the CPGs of three species of the genus Citrullus were de novo assembled and annotated followed by a comparative analysis with four other Citrullus species, all methods were strictly conducted according to the above relevant experimental methods. The chloroplast genome size of the seven Citrullus species was 157,005 ± 102 bp,  [20][21][22] . The noncoding regions showed higher sequence divergence than the protein-encoding regions, suggesting that most sequence variation in the noncoding regions could be further exploited and utilized. Repeated sequences could play a significant role in producing variation and rearranging genomes in CPG 23 . The TR, PR, DR, and SSRs were analyzed in the CPGs of seven Citrullus species in our research, a lot of repeat sequences were located in the intergenic regions of the Citrullus CPGs, the resemble results to our study have been found in the CPGs of Prunus and Quercus 10 . The TR and mono SSR numbers in the chloroplast genome of C. naudinianus were greater than those in the other Citrullus species, which could contribute to the genetic divergence between C. naudinianus and other Citrullus species. Polymorphic SSRs in the chloroplast genome have been applied to study on the population genetics and evolutionary of Citrullus 24 , in our study, Thirty-four SSRs were detected to be polymorphic among Citrullus species, suggesting our identified polymorphic SSRs could be developed to study on the genetics and evolutionary in Citrullus.
The SNPs and INDELs on the chloroplast genome have been used to study the diversity existed at the intraspecies and inter-subspecies level [25][26][27][28] . A total of 1271 SNPs were identified in the CPGs of seven Citrullus species, approximately 1 SNP/123 bp, which is less than that in the nuclear genome of 1 SNP/18 bp 1 , suggesting that the mutation rate of the chloroplast genome is lower than that of the nuclear genome. Forty-three protein coding genes have been changed by 19.0% SNPs in the coding region, especially, the accD, clpP, ndhF, rpoC2, ycf1, and ycf2 harbor more than 10 missense SNPs. These genes play a key role in plant photosynthesis and energy metabolism, but further research is needed to better understand the differences resulting from their mutations among Citrullus accessions. Although distinct differences in phenotype were observed between C. lanatus subsp. vulgaris and C. mucosospermus, they have only four SNPs at their CPGs, further supporting that C. mucosospermus is the most closely related wild species 1 .
The high nucleotide diversity of noncoding regions and coding regions could be developed as molecular markers or DNA barcodes for species identification, taxonomy and phylogenetic relationship analysis 29    www.nature.com/scientificreports/ intergenic regions (trnS-trnG, rps16-trnQ, rps15-ycf1, rpl32-trnL, and psbE-petL) with the highest nucleotide diversity (Pi) were identified in our study, while only rps16-trnQ could be used to develop specific DNA barcodes for the identification of germplasm resources in Citrullus. However, rps16-trnQ has not been used in the Citrullus taxonomy related study, but it has been widely used as a high-effect molecular marker and DNA barcode in other plant families 28,30 . The matk and rbcL genes are commonly used as traditional DNA barcodes for Cucurbitaceae and other land plants [31][32][33] , but their Pi values were 0.0021 and 0.0012, respectively, and most genes analyzed in this study have higher Pi values (Table S2), suggesting that matk and rbcL are not the best choice for using as DNA barcodes in Citrullus.
The CPG is an important molecular resource with better value for probing phylogenetic relationships than the whole nuclear genome in most plants 10 . Phylogenetic analysis based on the CPG has been used to reveal the phylogenetic relationships of species in Cucurbitaceae 22,23,34 . Phylogenetic trees were built in our study based on the CPG, CDS, SNPs and DNA barcode (rps16-trnQ) and showed that C. naudinianus was sister to the other six species, followed by Citrullus colocynthis, Citrullus rehmii, Citrullus ecirrhosus, Citrullus amarus, Citrullus mucososperms, and Citrullus lanatus subsp. vulgaris, which was similar to the previously reported Citrullus phylogeny constructed based on phenotypic traits, ancient literature, murals, traditional chloroplast DNA barcode, and molecular markers of the nuclear genome 1,2,35-39 , indicating that the phylogenetic tree based on our data has have high reliability for resolving the relationship of the seven species, and the molecular markers (SNPs and DNA barcodes) identified in our study have potential to be used to identify and classify species in Citrullus. In addition, the completed chloroplast genome of C. rehmii (PI632755) was published in the NCBI genome (NC_035975) in a previous study (Unpublished), when C. rehmii (PI632755) was added to the phylogenetic analysis, the results showed that PI632755 is closer to C. colocynthis than to C. rehmii ( Supplementary Fig. S6), suggesting that PI632755 could be classified as C. colocynthis, although it is still considered to be C. rehmii at U.S. National Plant Germplasm System and many previous studies 1, 40 .
Selective pressure analysis of genes could provide an important insight into the adaptive evolution of species. The KA/KS ratio is commonly applied to estimate the impacts of natural selection on creatures, and how the gene mutations promote the adaptive evolution of species 22,41 . In the present study, three genes (atpB, clpP1, and rpoC2) that were subjected to positive selection. Among them, atpB encodes the ATP synthase CF1 beta subunit and plays an essential role in photosynthesis. The increasing requirements for light in watermelon inflicted strong selective pressures on genes involved in photosynthesis during plant adaptive evolution, which supports our finding of positive selection for atpB. the similar conclusions were also reported for other species in Cucurbitaceae 22,34 . The product of the clpP gene is an ATP-dependent clp protease, which has been reported to participate in the transformation of chloroplast protein and shoot development 12,22 . The fact that clpP underwent positive selection in our study could be related to the adaptive evolution of the vining feature like other species in Cucurbitaceae 22 . Similarly, rpoC2 is validated to encode RNA polymerase β and could regulate the pollination process and sex differentiation in the genus Siraitia (Cucurbitaceae) 22 and sorghum 42 . Citrullus naudinianus is dioecious, and the other species in Citrullus are monoecious 1 , so the rpoC2 mutation could be essential for the adaptive evolution of sex differentiation in Citrullus. Therefore, these chloroplast functional genes could not only participate in energy metabolism, but also in plant organ development and sex differentiation.

Conclusion
The present research reported the CPGs of Citrullus rehmii, Citrullus naudinianus, and Citrullus ecirrhosus, which were valuable resources for understanding the genus Citrullus. The CPG average length of Citrullus is 157,005, and all CPGs are composed of 130-133 genes. The CPG of the seven Citrullus species studied here showed a good level of similarity in terms of gene content, gene order and genome structure. However, the contraction and expansion of the SSC regions contribute to changes in the chloroplast genome size of Citrullus. The greatest sequence divergence was observed between C. naudinianus and C. lanatus subsp. vulgaris or C. mucosospermus, whereas the smallest sequence divergence was between C. lanatus subsp. vulgaris and C. mucosospermus. Thirty-four polymorphic SSRs, 1,271 SNPs and 234 INDELs were identified, and few variations in the coding regions led to the variation of 45 protein-coding genes. The rps16-trnQ can be developed as a high-resolution DNA barcode beneficial for taxonomy in Citrullus. Phylogenetic relationship analysis indicated that the clear phylogenetic relationship of Citrullus and molecular markers developed in our study could be applied to the species identification and classification of Citrullus. Three protein-coding genes (atpB, clpP1, and rpoC2) were subjected to selection pressure, which would promote the adaptation of Citrullus to the growing environment. Our research results provide a basis for understanding the genetic differences in the chloroplast genome of Citrullus and the domestication of cultivated watermelon.

Materials and methods
Plant material. Healthy fresh leaves were collected from young plants of C. lanatus subsp. vulgaris (W1-1), C. mucosospermus (PI186490), and C. amarus (PI269341) at the horticultural station of teaching and experimentation of Jiangxi Agricultural University (JXAU) in the spring of 2022. The total DNA of these plant materials were isolated from about 3 g of young leaves using a modified cetyltrimethylammonium bromide (CTAB) method. We used agarose gel electrophoresis (1%, 180 V for 45 min) and an Agilent BioAnalyzer 2100 (Agilent Technologies Inc., California, America) to evaluate the DNA quality and quantity. Approximately 300 μg of total DNA from each sample was stored at -80 ℃ until use. Our experimental research and field studies on plants, including the collection of plant material, comply with the guidelines of JXAU and relevant institutional, national, and international guidelines and legislation. www.nature.com/scientificreports/ Chloroplast genome assembly. To perform de novo assembly and compare the seven extant species in the genus Citrullus, for C. rehmii (SRR8751709), C. ecirrhosus (SRR8751725) and C. naudinianus (SRR8751817), we used Fastq-dump to download their clean sequence data with 150 bp paired-end reads of total DNA 1 . Next, the CPGs of the three Citrullus species above were assembled using the following method. First, to preclude the influence on chloroplast genome assembly from the consensus reads of nuclear and mitochondria with chloroplasts, 0.8-Gb sequence data were randomly drawn using the Sample function of Seqtk (v0.1) (https:// github. com/ lh3/ seqtk) and then assembled by plasmidspades.py in SPAdes (v3.10.1) 43 . Next, scaffolds with high coverage and length representing the chloroplast genome were extracted and ordered into a draft genome sequence by BLAST against the reference CPG of C. lanatus subsp. vulgaris (NC_014043.1). Gaps in the chloroplast draft genome sequence of each accession were repaired by GapCloser (v1.12-r6) 44 and overlap sequences at two flanks of gaps were deleted manually. Finally, the integrity and quality of the chloroplast genome sequence was then inspected and manually corrected by reference-guided mapping using BWA 45 and SAMtools 46 untill the CPG was obtained.
Genome annotation. The three newly assembled CPGs were annotated using CpGAVAS2 47 and GeSeq 48 , followed by manual adjustments of gene boundaries with errors through inspection using IGV v2.4 49 . The final GenBank and Sequin format of the annotated information was produced in GB2sequin 50 . All records with were then deposited in a Sequin format, and they can be viewed in the NCBI database. GenBank files were created using Sequin v15.50 (http:// www. ncbi. nlm. nih. gov/ Sequin/ index. html) and then drew the chloroplast genome maps in OGDRAW v1.3.1 51 .
Comparative analysis of genome feature. To explore the intergenic variation among the CPGs of seven Citrullus species by genome comparison, the chloroplast reference genomes of C. lanatus subsp. vulgaris, C. amarus, and C. colocynthis that we had published in our past researches were also included [14][15][16] , and the complete chloroplast genome of C. mucosospermus were directly obtained from the Organelle Genome Database (OGD; NCBI website). The data for the seven Citrullus species were listed in Table 5. The gene rearrangements at the boundary between the single copy (SC) region and inverted repeat (IR) region were inspected and analyzed by IRscope 52 to understand the variation in binding regions within the CPGs among the seven Citrullus species. The chloroplast genome collinear analysis among the seven Citrullus species was conducted using Mauve Alignment in Geneious Prime 53 .
Sequence divergence and nucleotide diversity analysis. To inspect the CPG sequence divergence of the seven Citrullus species, the mVISTA web tool was used to align and visualize the results with default parameters 54 . The CPG of C. lanatus subsp. vulgaris was used as reference, and then the pairwise distances based on the CPG of seven Citrullus species was estimated by MAFFT v7 55 and MEGA X 56 . Nucleotide diversity in the CPGs of the seven Citrullus species were scanned using DNA polymorphism in DNAsp v6.12.03 with an 800bp window and 200-bp step size 57 . Parsimony-informative base sites (PIC), variable sites and K-2P distance of selected DNA barcodes were analyzed in MEGA v11 software 58 . Repeated sequence identification. Dispersed repeats (DR), palindromic repeats (PR), long tandem repeats (TR) (greater than 7 bp), and simple sequence repeats (SSR) were explored in the CPGs of seven Citrullus species. Vmatch software was applied to identify the DR and PR with a minimal repeat size of 30 bp, a hamming distance is 3, and a similarity percentage between two repeat copies greater than 90%, the overlapping repeats in the results were merged into one repeat motif by manual correction. TR at least 7 bp in length was detected by Tandem Repeats Finder program with the alignment parameters for match, mismatch, and INDELs set at 2, 7, and 7, respectively 59 . SSRs were determined by MISA with thresholds of 10, 5, 4, 4, 4, and 4 for mono-, di-, tri-, tetra-, penta-, and hexa-nucleotides, respectively 60 .

Variation calling.
To identify the single nucleotide polymorphisms (SNPs) and insertions or deletions (INDELs) of the CPGs between cultivated watermelon (C. lanatus subsp. vulgaris) and six other Citrullus species, we downloaded their total DNA sequence data from the NCBI SRA database, as shown in Table 5. Then, SNP validation and SSR marker development. To validate the missense SNP of petB between cultivated watermelon and C. mucosospermus, the petB gene was cloned from W1-1 (C. lanatus subsp. vulgaris) and PI186490 (C. mucosospermus). First, the primer was designed according to the CDS of petB in W1-1 (Table 6). Then, total mRNA was extracted from leaves and reverse transcribed for first-strand cDNA synthesis using the M5 Sprint qPCR RT kit according to the manufacturer's instructions (Mei5 Biotechnology, Co, Ltd). The DNA fragments were amplified and purified according to the operation instructions of the Hieff Canace Plus PCR Master Mix and TIAgel purification kit, and sequenced using the targeted pyrosequencing assay in Sangon Biotech (Shanghai) Co., Ltd. Three selected SSRs with flanking sequences were used to design primers in Primer Premier 5.0 68 software with default parameters ( Table 6). The polymorphism of molecular markers was verified among W1-1 (C. lanatus subsp. vulgaris), PI186490 (C. mucosospermus), and PI269341 (C. amarus) using PCR amplification, and PCR assay was conducted according to our previous study 69 . The PCR products were separated with by 6% denaturing polyacrylamide gel electrophoresis (PAGE) and detected by silver staining.