Complete chloroplast genome sequences of four Allium species: comparative and phylogenetic analyses

The genus Allium is one of the largest monocotyledonous genera, containing over 850 species, and most of these species are found in temperate climates of the Northern Hemisphere. Furthermore, as a large number of new Allium species continue to be identified, phylogenetic classification based on morphological characteristics and a few genetic markers will gradually exhibit extremely low discriminatory power. In this study, we present the use of complete chloroplast genome sequences in genome-scale phylogenetic studies of Allium. We sequenced and assembled four Allium chloroplast genomes and retrieved five published chloroplast genomes from GenBank. All nine chloroplast genomes were used for genomic comparison and phylogenetic inference. The chloroplast genomes, ranging from 152,387 bp to 154,482 bp in length, exhibited conservation of genomic structure, and gene organization and order. Subsequently, we observed the expansion of IRs from the basal monocot Acorus americanus to Allium, identified 814 simple sequence repeats, 131 tandem repeats, 154 dispersed repeats and 109 palindromic repeats, and found six highly variable regions. The phylogenetic relationships of the Allium species inferred from the chloroplast genomes obtained high support, indicating that chloroplast genome data will be useful for further resolution of the phylogeny of the genus Allium.


Results
Genome sequencing and assembly for four Allium species. Four Allium species were sequenced, and 8,255,274-13,393,542 paired-end clean reads were obtained. Three complete cp genomes (A. fistulosum, A. sativum and A. cepa) were directly assembled by NOVOPlasty 2.6.2. A. tuberosum Rottl. ex Spreng. was not circularized by NOVOPlasty. We assembled this genome using SPAdes 3.11.1 and visualized it with Bandage 0.8.1. According to the "Depth range" (≥500), five merged nodes were selected and used to align with the reference NC_024813.1 in Mummer 3.23. The nodes for which the order had been determined were linked to two super-contigs based on their overlap ( Supplementary Fig. S1). Two pairs of primers (p1, p2 and p3, p4) were designed according to the two gaps and their sites information provided in Supplementary Fig. S1. Then, PCR amplification and Sanger sequencing were conducted to fill these gaps. Last, the complete cp genomic sequences were assembled by SPAdes 3.11.1 with the options of "-trusted-contigs" including five node and two gap sequences. Alternatively, the sequences from the five nodes and two gaps could be linked manually according to the alignment graph ( Supplementary Fig. S1). As a result, four complete cp genomes had been assembled by using the data from two sequencing platforms (HiSeq 4000 and 2500) and two read lengths (150 and 100 bp) (Supplementary Table S1). Finally, the four complete genomes were evaluated by Qualimap v.2.2.1 using the corresponding paired-end reads. The most obvious difference was that the cp DNA extraction methods, including HSLp and SucDNase, produced higher rates of mapping and mean coverage than the total DNA extraction method (Supplementary Table S2). Additionally, HSLp (high-salt low-pH) method had the highest rate of mapping (74.05% and 92.36%), and it was more effective than SucDNase in isolating cp DNA from other DNA (nuclear DNA and mitochondrial DNA) (Supplementary Table S2). Although its mapping rate was the lowest (3.99), the extraction method of total DNA (A. cepa) also obtained sufficient sequencing depth (334.02X) and a better assembly result because of a large number of reads and a very small cp genome (Supplementary Table S2). The four new complete cp genome sequences were deposited in GenBank (accession numbers: MK335927, MK335929, MK335928 and MK335926).
Organization and gene content of nine Allium species. The nine complete cp genome sequences, which consisted of the four Allium species sequenced in this study and five accessions from GenBank, were combined for comprehensive analysis. The genomes ranged in size from 152,387 bp (A. obliquum) to 154,482 (A. prattii) (Fig. 1). All of these genomes presented typically quadripartite structures, with two IRs (26,564) separated by the LSC (81,392) and SSC (17,066) regions (Table 1). Allium cp genomes showed similar gene content and order, containing 140-141 genes consisting of 88-89 protein-coding genes, 37-38 tRNA genes, 5-10 pseudogene and 8 rRNA genes located in the IR regions ( Fig. 1; Table 1). Main components and their proportions were high conserved in eight Allium cp genomes except for A. prattii (Supplementary Tables S3  Figure 1. Gene map of the nine Allium chloroplast genomes. The genes inside and outside the circle are transcribed in the clockwise and counter-clockwise directions, respectively. Genes belonging to different functional groups are colour coded. The thick lines indicate the extent of the inverted repeats (IRa and IRb) that separate the genomes into large single-copy (LSC) and small single-copy (SSC) regions. Grey bars on the inside of circle indicate GC content, with the line representing 50%. www.nature.com/scientificreports www.nature.com/scientificreports/ and S4). However, the numbers and components of pseudogene differed substantially (Table 1; Supplementary  Table S5), including those of atpB, ψinfA, rps16, rps2, rbcL, trnL-UAA and ycf2. The genes of atpB, rbcL, trnL-UAA and ycf2 were pseudogenes only in A. prattii. The pseudogene infA was absent in A. tuberosum Rottl. ex Spreng.. The rps16 gene was a pseudogene in A. obliquum and A. sativum but a protein-coding gene in the other seven cp genomes. The rps2 gene was a pseudogene in seven accessions (A. fistulosum, A. tuberosum Rottl. ex Spreng., A. sativum, A. cepa N, A. cepa CMS-T, A. cepa CMS-S, and A. obliquum) but was a protein-coding gene in A. prattii and A. victorialis. One tRNA (trnL-UAA) was converted to a pseudogene because of the lack of a 5′ end in only A. prattii.

Species
The overall GC content of different regions or components, including complete cp genome, LSC, IR, SSC, coding sequences (CDSs), tRNA, rRNA and pseudogene, was determined based on their annotation. Except for the pseudogene category, the GC content of nine complete cp genomes was very similar in each category (Supplementary Table S6). However, the GC content in eight regions or components of each genome exhibited distinct differences (Supplementary Table S6). The highest was observed in rRNA and the lowest in SSC. The order of the GC content was as follows: rRNA (>55%), tRNA (>52%), IR (>42%), CDSs (>37%), complete genome (>36%), LSC (>34%), SSC (>29%) (Supplementary Table S6). In the pseudogene category, there were relatively large differences among nine Allium taxa due to the components and numbers of pseudogenes (ψinfA,  Table S6).
iR/Sc boundary. The IR/SC boundary regions of the 11 complete cp genomes were compared, and the IR/SC junctions showed substantial differences (Fig. 2). From basal monocots of Acorus americanus to Agapanthus or Allium, the expansion of the IRs to rps19 or rpl22, which was described using PCR sequences by Wang 43 , was also observed at the IR/ LSC junctions. In Acorus americanus, rps19 flanked the junction between LSC and IRb (JLb), while a partial sequence of rps19 was present in the IR regions and another located in the LSC region. However, two IRs all contained a complete trnH-rps19 cluster with a length of 81-84 bp away from the IR/LSC boundary in Allium and 52 bp in Agapanthus coddii. Then, JLb expanded into the 5′ portion of the rpl22 gene with a length of 33-36 bp in Allium. The junctions of IR/SSC were located in the gene ycf1(a or b) or between ycf1b and ndhF. The 3′ end of the gene ycf1b and ndhF exhibited substantial differences for expansion or contraction of IRs. Overlaps of orf (ycf1a and ndhF) were observed in A. fistulosum, A. tuberosum Rottl. ex Spreng., A. cepa (N, CMS-T and CMS-S), A. obliquum, A. victorialis and Agapanthus coddii. The length from ndhF to the junction between SSC and IRb (JSb) exhibited a distinct difference (from 180 to −31), so the boundary characteristics of IR/SSC were more complex than those of IR/LSC. Overall, the IR/SC boundary regions in the nine Allium species showed similar characteristics, with only slight differences in the length flanking or away from the boundary in the organization genes, namely, rpl22, rps19, ycf1b, ndhF, ycf1a and psbA.
Repeat sequence analysis. The numbers and distributions of three repeat types (tandem, dispersed and palindromic repeats) in the nine Allium cp genomes were similar and conserved ( Fig. 3A; Supplementary  Table S8). There were 394 repeats, including 131 tandem repeats, 154 dispersed repeats and 109 palindromic repeats (Supplementary Table S8). These repeats were distributed in 657 sites containing 131 tandem repeat sites and 526 dispersed and palindromic repeat sites (one site was counted in one tandem repeat and two sites in one dispersed or palindromic repeat) (Supplementary Datasets 1 and 2). The lengths of the repeat units ranged from 11 to 91 bp. Based on the quadripartite structure of the cp genome, LSC regions had the most repeat sites (411, 62.56%), followed by IR (198, 30.14%), SSC (42, 6.39%) and the overhanging junction (6, 0.91%, 1 SSC/ IRa and 5 IRb/SSC) ( Supplementary Fig. S2A). According to the classification of gene structure, CDS, IGS (intergenic spacer) and intron, a majority of the repeat sites were in IGS regions (451, 68.65%), in which the ycf2~trnI contained the most numbers of repeat sites (2 × 27, 2 × 4.11%), and a minority were in introns (36, 5.48%) ( Supplementary Fig. S2B). Only a few types of gene (e.g., psaA, psaB, rpoC2, trnF, ycf1a, ycf1b, ycf1b-ndhF, ycf2) possessed repeat elements, and the gene ycf2 contained the highest number of repeat sites (120, 18.26%). A. obliquum, with 55 repeats, had the maximum number of repeats, and A. victorialis and A. cepa CMS-S, with 37 repeats, had the lowest number of repeats (Supplementary Tables S8). We detected 814 simple sequence repeats (SSRs) in the nine cp genomes using the Perl script MISA. The numbers of SSRs differ among the nine Allium genomes and vary from 73 in A. tuberosum Rottl. ex Spreng. to 96 in A. cepa N and T, as shown in Supplementary Table S9. The most abundant SSR motifs were mononucleotide repeats, which accounted for approximately 66.71% of the total SSRs, followed by dinucleotide (16.71%) and tetranucleotide (11.67%) repeats. We also found that there were more tetranucleotide repeats than trinucleotide and pentanucleotide repeats. Hexanucleotide repeats were very rare across these cp genomes, appearing only once in A. tuberosum Rottl. ex Spreng. and A. cepa CMS-S. Almost all mononucleotide repeats were composed of A/T (98.16%), with only 1.84% composed of C/G. AT/AT repeats constitute was 80.88% of dinucleotide repeats, while AG/CT repeats constitute only 19.12% (Fig. 3B).  Table S10). There were only slight differences in CNGs (conserved non-gene sequences) (93.88-96.62%) (Supplementary Table S10).

Sequence divergence analysis. With
The nucleotide diversity in the complete cp genome, LSC, IR, and SSC was compared among the nine Allium cp genomes. Based on analysis by DnaSP version 6.1 software, polymorphic sites, parsimony-informative sites, and nucleotide diversity were determined. In the complete cp genomes, 5,552 polymorphic sites (3.49%) and Scientific RepoRtS | (2019) 9:12250 | https://doi.org/10.1038/s41598-019-48708-x www.nature.com/scientificreports www.nature.com/scientificreports/ 2,502 parsimony-informative sites (1.57%) were observed, and the nucleotide diversity was 0.01244 (Table 2). SSC regions exhibited higher divergence (0.02564) than LSC (0.01585) and IR (0.00297) regions (Table 2). To further calculate the sequence divergence level in the local regions of cp genomes, the nucleotide diversity (pi) value within a 600-bp window was calculated with 200-bp steps. These values varied from 0 to 0.05787. Then, six highly divergence regions (or hotspot regions) (Table 3), namely, trnK-rps16 (exon2-intron), trnT-trnL, trnL-trnF-ndhJ, ndhF-rpl32 -trnL, rpl32-trnL-ccsA, and ycf1a, were identified with a cut-off of 0.04. These hotspots were all located The p-distance and number of nucleotide substitutions were used to estimate divergence among the nine Allium cp genomes. The p-distance ranged from 0.00006 to 0.02306 with an overall average of 0.01244, and the number of nucleotide substitutions was found to range from 9 to 3,411 (Supplementary Table S11). A. prattii and A. sativum exhibited the greatest sequence divergence (0.02306). A. cepa N exhibited only nine nucleotide substitutions compared to A. cepa CMS-T but 316 nucleotide substitutions compared to A. cepa CMS-S. These results further indicated that the onion species with N and CMS-T cytoplasm were more closely related to each other than to that with a CMS-S cytoplasm. phylogenetic analysis. In this study, six datasets (complete chloroplast genome, IR, LSC, SSC, SC and the combined variable regions) from 11 cp genomes sequences were created on the basis of their annotation, and the number of sites used to construct phylogenetic trees ranged from 3,724 to 137,185 (Supplementary Table S12). According to the identification results obtained by jModeltest v2.1.10, the best-fit models for each dataset based on the Akaike information criterion (AIC) are listed in Supplementary Table S12. The maximum likelihood (ML) and Bayesian inference (BI) models were selected based on the above results and the RAxML v8.2.12 manual. The topologies of the phylogenetic trees based on the two methods of analysis (ML and BI) were identical for each dataset. And the datasets generated similar topological structures with a very high support, except for the IR dataset (Fig. 6). Allium species and Agapanthus coddii produced two distinct branches with very high support (100% and 1.00). In the genus Allium, nine accessions were divided into two sister clades. The first clade contained two species, namely, A. prattii and A. victorialis. The 2nd clade included seven accessions, in which A. cepa (CMS-T and N) was grouped in a sister branch and then clustered step by step with A. cepa CMS-S, A. fistulosum, A. obliquum, A. sativum and A. tuberosum Rottl. ex Spreng.
From the dataset of the divergence hotspots (including only 229 parsimony-informative sites) (Table 3), we also inferred the phylogenetic relationships that were identical to other datasets except for IR dataset, in terms of topological structure (Fig. 6). Moreover, even in the infraspecific classification, three cytoplasmic types of A. cepa were also clearly identified. The cp genome can be used to resolve the deeper branches within species. However, the present study analysed only a limited number of species. With the rapid improvement of sequencing technologies, the sequencing of complete cp genome will become routine. Therefore, more and more complete sequences of cp genome will be used to further elucidate the phylogenetic relationships of the genus Allium.

Discussion
In this study, four complete cp genome sequences were sequenced and assembled. The genome sizes ranged from 153,162 bp (A. fistulosum) to 154,056 bp (A. tuberosum Rottl. ex Spreng.), similar to the genome size for A. cepa reported by Kohn 42 and Kim 44 . Subsequently, nine complete Allium cp genome sequences, including those of the four Allium species sequenced in this study and five obtained from GenBank, were compared. The cp genomes of Allium are highly conserved, with identical gene content and order, and genomic structures comprising four     Table S6), which has also been observed in other angiosperm cp genomes 42,44,45 . Additionally, the rps12 gene is a trans-spliced gene with the 5′ end located in the LSC region and the duplicated 3′ end in the IR region, as has been  www.nature.com/scientificreports www.nature.com/scientificreports/ identified previously in other reports 42 . However, the numbers and components of pseudogene were substantially different, especially the loss of sequence infA in A. tuberosum Rottl. ex Spreng.. In addition, surprisingly, the genes atpB, rbcL, trnL-UAA and ycf2 were present as pseudogene in only A. prattii but as protein-coding genes in the other eight cp genomes (Supplementary Table S5). This transformation might be caused by sequence contamination originating from the mitochondrial genome 46,47 or by annotation error, as previously discussed for Fagaceae 48 . Due to the components and numbers of pseudogene, the GC levels and lengths of the pseudogene also varied by species from 35.9% to 41.04% and 1,043 to 17,782, respectively (Supplementary Table S7).
The change in position of the IR/SC junction may have been caused by contraction or expansion of the IR region, which is a common evolutionary phenomenon 34,43,[49][50][51] and may cause variations in the lengths of angiosperm plastid genomes 49 . In the nine Allium species, the IR/SC boundary regions showed similar characteristics, with only slight differences observed in the length flanking or away from organization genes, namely, rpl22, rps19, ycf1b, ndhF, ycf1a and psbA (Fig. 2). Expansion of IR regions was also found from basal monocots of Acorus americanus to Agapanthus or Allium 43 . The complete trnH-rps19 cluster is present in IR regions, in which this type of IR/LSC junction is consistent with TYPE III, as reported by Wang 43 . We also found that A. cepa N and CMS-T exhibited the same features in four junctions, but A. cepa CMS-S exhibited slight differences compared with A. cepa N and CMS-T. For example, the length of extension of rpl22 into IRb was 34 bp in N and CMS-T and 33 bp in CMS-S. These differences in length were also exhibited by other organization genes, such as rps19, ycf1b, ycf1a, ndhF and psbA. These results maybe hint the difference in origin and evolution by previous reports 44,[52][53][54][55][56] .
Large, complex repeat sequences may play important roles in the rearrangement of plastid genomes and sequence divergence 57,58 . We found that the repeat sites in the nine Allium species were similar and conserved, usually located in IGS regions (451, 68.65%) (Supplementary Fig. S2). Surprisingly, the ycf2 gene contained the most repeat sites (120, 18.26%), and the ycf1 gene contained only seven sites (1.07%) (Figs S2 and S3). Moreover, we identified 81 long repeats of more than 40 bp, accounting for approximately 20.56% of the total 394 repeats. This rate is similar to previous reports on other plant lineages 34,59,60 . SSRs are thought to be the results of slipped strand mispairing during DNA replication, which are frequently observed in cp genomes and have been shown to have substantial application potential in population genetics and breeding programmes [61][62][63] . In this study, 814 SSRs were identified, with the most abundant mononucleotide repeats, accounting for 66.71% of the total SSRs, followed by dinucleotide, tetranucleotide, trinucleotide, pentanucleotide, and hexanucleotide repeats. Almost all mononucleotide repeats were composed of A/T (98.16%), with only 1.84% composed of C/G. Among dinucleotide repeats, AT/AT accounted for 80.88%, while AG/CT for only 19.12%. Our results are comparable to previously reported findings that SSRs in cp genomes are composed of polyadenine (poly A) or polythymine (poly T) repeats and rarely contained tandem guanine (G) or cytosine (C) repeats 45 . We also found that tetranucleotide repeats were more abundant than trinucleotide and pentanucleotide repeats, which is consistent with a report on Quercus species 64 . Hexanucleotide repeats were very rare across the nine Allium cp genomes, similar to the results in Lilium 45 . These new SSR resources will potentially be useful for population studies on the Allium genus, especially in combination with other informative nuclear genome SSRs.
The alignments of the nine Allium complete cp genomes revealed a high degree of synteny (Fig. 4). SC regions were more divergent than two IR regions, and the non-coding regions exhibited greater divergence than the coding regions. Similar results are reported previously for many cp genomes 31,39,45,64 . Differences in accD in S/N-cytoplasmic onions, which were detected in a previous report 42 , were also observed (Fig. 4). The nucleotide substitution rate is a central feature of molecular evolution 65 . All pair-wise sequence comparisons in our study showed that the number of nucleotide substitutions differed greatly, ranging from 9 to 3,411 (Supplementary  Table S11). This result suggests that DNA sequences evolve at different rates in different species which had also been observed in other taxa 66 .
The phylogenetic trees based on different datasets produced similar topological structures, except for the IR dataset, possibly because this region was highly conserved and provided fewer informative sites than the SC regions (Table 2). First, Allium species and Agapanthus coddii produced two distinct branches with very high support (100% and 1.00). Then, in the genus Allium, nine taxa were divided into two clades. The first clade included two species (A. prattii and A. victorialis) belonging to the second evolutionary line described in previous reports 2,11,22 . The other species, including A. tuberosum Rottl. ex Spreng., A. sativum, A. obliquum, and A. cepa (CMS-T, CMS-S and N), grouped into clade two, belonging to the third evolutionary line 2,11,22 . In A. cepa, CMS-T and N grouped into a sister branch and then clustered with CMS-S. These results indicate that CMS-T and N share a close relationship and that CMS-S does not originate from A. cepa N and T, which is consistent with previous reports that the S cytoplasm had two origins 67 . The S cytoplasm may be an alien cytoplasm transferred from an unknown Allium species to the bulb onion through the viviparous interspecific triploid 'Pran′. Alternatively, the S cytoplasm could be a component of one or more wild populations from which onion was domesticated, and S-cytoplasmic plants could be the seed parents of 'Pran' 67 .
The results presented here not only robustly support previous reports of three major clades inferred by ITS, rps16 and matK data 1,2,11,12,14-21 but also show increased bootstrap or posterior probability values, especially at low taxonomic levels (such as infraspecific classification). These results also suggest that cp genome data can effectively resolve the phylogenetic relationships of the genus Allium. Although the matK, rps16, and ITS genes have been widely used to investigate taxonomy in Allium, these markers will exhibit increasingly extremely low discriminatory power because this genus contains more than a lot of species and especially a large number of new www.nature.com/scientificreports www.nature.com/scientificreports/ Allium species are being identified step by step 11,21 . Fortunately, for the purposes of this study, the cp genomes of Allium species have highly divergent regions with far greater nucleotide diversity than matK and rps16. Moreover, using the dataset of highly divergent regions, we also inferred the deeper phylogenetic relationships that were identical with other datasets, except for the IR dataset. This finding also suggests that these regions are useful resources for the phylogenetic analysis of Allium species, not only in interspecies but also infraspecific classification. Although the present study analyzed a limited number of species, our data will serve as a reference for future genome-scale phylogenetic studies of Allium. With the rapid improvement in sequencing technologies, sequencing of complete cp genome will become routine. Therefore, an increasing number of cp genome sequences will be used to further elucidate the phylogenetic relationships of the genus Allium.

Methods
Plant material and DNA extraction. Fresh leaves of four Allium species were harvested from the Vegetable and Flower Research Institute of Shandong Academy of Agricultural Sciences. DNA samples were isolated by three methods (Supplementary Table S1): (1) Total genomic DNA for A. cepa N (N218) was isolated using the Plant Genome Extraction Kit (PGEK) (Tiangen Biotech, Beijing, China); (2) cpDNA for A. tuberosum Rottl. ex Spreng. was isolated by the sucrose-DNase (SucDNase) method 68 ; (3) cpDNA for A. fistulosum and A. sativum was isolated by the high-salt low-pH (HSLp) method 69 . DNA concentration and quality were measured using a NanoPhotometer P330 (Implen GmbH, Munich, Germany) and agarose gel electrophoresis.
Genome sequencing, assembly, and annotation. DNA samples from A. fistulosum, A. tuberosum Rottl. ex Spreng. and A. sativum were sheared to construct a ~350-bp paired-end library in accordance with the Illumina HiSeq 4000 protocol to obtain an average read length of 150 bp (Supplementary Table S1). Another ~350-bp paired-end library for the A. cepa sample was constructed using the Illumina HiSeq 2500 protocol with an average read length of 100 bp (Supplementary Table S1). Quality control of the raw sequence reads was performed using an ultra-fast FASTQ preprocessor, fastp version 0.15.0 70 , using default parameters, except -q 20 and -n 10. Each species yielded at least 1.2 Gb of clean data (Supplementary Table S2).
First, high-quality reads were assembled by NOVOPlasty 2.6.2 71 with the default parameters set using the seed sequence AcrbcL from the reference sequence NC_024813.1. The orientation was resolved manually based on NC_024813.1. A. tuberosum Rottl. ex Spreng. was not circularized by NOVOPlasty, and therefore, we assembled this sequence using SPAdes 3.11.1 72 . The file of "fastg" was visualized by the software Bandage 0.8.1 73 , and the alignment of nodes or contigs were conducted by Mummer 3.23 74 . Gaps in the cp genome sequences were filled by PCR amplification and Sanger sequencing based on reference NC_024813.1. PCR was performed in a total volume of 25 μl using the TaKaRa PCR Amplification Kit (TaKaRa Biotechnology, Dalian, China). The PCR mixtures contained 50 ng template DNA, 0.2 μM of each primer, and 12.5 μl PCR Mix. The primer pairs p1 (5′ GAGACTACCAGATCCCCGCTAT 3′) and p2 (5′ CTTTGGAATACTGGAAGGGTCG 3′) were used to amplify gap1, and p3 (5′ ATGTCGAATACTAACTTATCTGTCTGC 3′) and p4 (5′ ATTTCACCATAGCGGCTTACTT 3′) to gap2. The PCR protocol was as follows: initial denaturation at 94 °C for 4 min, followed by 35 cycles of 94 °C for 30 s, 50 °C for 30 s, 72 °C for 1 min, with a final 5 min extension at 72 °C. The amplified products were separated on 1.0% agarose gels and visualized by ethidium bromide staining. Evaluation of the assembly was performed by Qualimap v.2.2.1 75 .
The complete cp genomes were annotated by plann v.1.1.2 76 using NC_024813.1 as a reference and then checked by DOGMA 77 (http://dogma.ccbb.utexas.edu/). The positions of start and stop codons, and the boundaries between introns and exons were manually corrected by comparison with the published cp genome of NC_024813.1. The annotated GenBank files were used to draw circular cp genome maps using OrganellarGenome DRAW (https://chlorobox.mpimp-golm.mpg.de/OGDraw.html). The organization and gene content of the nine Allium taxa were analysed according to the corresponding annotations. Then, the boundary regions of the LSC, SSC, and IRs were also compared from 11 accessions, including the nine Allium cp genomes and those of the closely related species Agapanthus coddii, and the basal monocot Acorus americanus.
Repeat element analysis. Tandem repeats were detected using Tandem Repeats Finder (TRF) version 4.09 78 with advanced parameters. The alignment parameters match, mismatch, and indel were set to 2, 7, and 7, respectively, and the minimum alignment score and maximum period size were set to 80 and 500, respectively. Other parameters were set to default values. The Perl script repfind.pl from Vmatch 79 was used to find dispersed and palindromic repeats in which the minimal repeat size was 30 bp and the two repeat copies had at least 90% similarity (i.e., a Hamming distance of 3, -h 3). Then, two types of repeats were sorted by Vmatch with the -d (or -p, separately) -l 30 -h 3 -sort ia options. Complete IRa and IRb were excluded from the palindromic repeats. The Perl script MISA 80 was used to detect SSRs or microsatellites. The minimum numbers of repeats were 10, 5, 4, 3, 3, and 3 for mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats, respectively.
Sequence divergence analysis. The complete cp genomes were compared using the mVISTA program 81 with A. victorialis as a reference, because of the least numbers of pseudogenes and the second length cp genome. Then, the sequences were first aligned using MAFFT v7.394 82 and manually adjusted in MEGA v7.0.26 83 . Subsequently, a sliding window analysis was conducted to evaluate the nucleotide variability (Pi) of the cp genome using DnaSP v6.12.01 software 84 . The step size was set to 200 bp, and the window length was set to 600 bp. Variable and parsimony-informative base sites across the complete cp genomes and the LSC, SSC, and IR regions of the nine cp genomes were calculated. The p-distance and number of nucleotide substitutions among Allium cp genomes were calculated using MEGA v7.0.26 83 software.