Introduction

Artemisia is one of the largest and most widely distributed genera in the family of Asteraceae. It is a heterogeneous genus consisting of more than 500 different species distributed mainly in Europe, Asia and North America1,2. These species are perennial, biennial and annual herbs or small shrubs3,4. Its pungent odor and bitter taste are due to terpenoids and sesquiterpene lactones5. Some Artemisia species are cultivated as crops, whereas others are used in preparing tea, tonic, alcoholic beverages and medicines6. Various biochemically active secondary metabolites have been identified in Artemisia species, including essential oils, flavonoids, terpenoids, esters and other substances4,7,8,9, which are potential bioactive compounds for developing novel herbal drugs against multiple diseases, such as cancer10, malaria11, hepatitis, inflammation12 and fungal, bacterial13 and viral infections14. Researchers have extracted artemisinin from Artemisia annua and demonstrated its antimalarial effects15. Tu et al. converted artemisinin into a drug that has saved millions of lives worldwide16, thus winning the 2015 Nobel Prize in medicine. These researchers have confirmed the medicinal value of Artemisia species and its potential use in bio-exploration.

Artemisia giraldii Pamp. is one of the 186 Artemisia species found in China. It is an herbaceous plant distributed only in some areas of China (e.g., Henan, Hebei, Gansu, Ningxia, Shannxi and Sichuan Provinces). Studies on A. giraldii are few and have mainly focus on its chemical composition, geographical distribution17 and community18. The main chemicals in Artemisia are terpenoids, flavonoids, coumarins, caffeoylquinic acids, sterols and acetylenes. Two flavones and several monoterpenoids and sesquiterpenoids have been isolated from the aerial parts of A. giraldii7,8. These two flavones named 4′,6,7-trihydroxy-3′,5′-dimethoxyflavone and 5′,5-dihydroxy-3′,4′,8-trimethoxyflavone showed antibiotic activity against Escherichia coli, Sarcina lutea, Pseudomonas aeruginosa and Aspergillus flavus8. A monoterpene, called santolinylol, which has antifungal activity, has been isolated from A. giraldii19,20. The flowering parts of A. giraldii are rich in essential oils. Studies have shown that these essential oils exhibit strong fumigant activity against Sitophilus zeamais adults and possessed substantial contact toxicity against maize weevils21.

Molecular breeding, genetic engineering and synthetic biology of Artemisia species have attracted considerable interest, which are critical to obtaining active materials efficiently. The first steps for genetic studies include sequencing and analysing the nuclear and organelle genomes.

Mitochondria and plastids originate from bacterial endosymbionts22. The convergent evolution of mitochondria and plastid can be observed between distantly related species, the same strain and even within the same cell. However, although mitochondrial and plastid genomes follow similar evolutionary paths, mitochondrial genomes have evolved much further23. The mitochondrial genome (mitogenome) is more complex than the plastid genome and more severe gene loss, more extensive and refined forms of post-transcriptional editing and processing, more gene isoforms and a wider range of gene fragmentation in most photosynthetic plants24,25. However, the number of plastid genes is not larger than that of mitochondrial genes in some plants. In some non-photosynthetic plants, such as Hypopitys monotropa26 or Rhopalocnemis phalloides27, the plastomes showed considerable gene loss and size reduction. The plastome size decreases up to 110–200 kb in autotrophic plants28. Co-extension/coexistence of mitochondrial and plastid genomes was observed in various species, and in most cases, plastid DNA was overtaken by mitochondrial DNA29. We can identify the interaction between the two organelles from the comparative analysis of mitochondrial and chloroplast genomes of the same species.

The animal mitogenome is normally a circular, compact molecule about 17 kb long with little variation in size. It contains about 13 protein-coding genes (PCGs), two ribosomal RNAs (rRNA) genes and 22 or 23 transfer RNA (tRNA) genes among bilaterians, with a few exceptions30. Although much larger mitochondrial genomes have occasionally been found, they are usually the product of duplicating portions of the mtDNA rather than variation in gene content31. Unlike the relatively simple animal mitochondrial genomes, non-parasitic flowering plant mitochondrial genomes were large and complex32,33,34. They exhibit a wide range of variations in size, sequence alignment and repeat content, but the coding sequences are highly conserved (typically 24 core genes and 17 variable genes)35,36. Usually, the mitogenome was represented as a monomeric circle with no mention of other forms37,38, as circular mapping is a convenient indicator of genome content and sequencing completion. Thus, the circular map appears in published plant mitochondrial genomes39. However, plant mitochondrial DNAs appear as linear and multi-branched molecules under electrophoresis and microscopy. At the same time, some studies have also proposed that plant mitochondria are non-circular. They are a collection of multiple forms, including circular, linear and branching molecules. Some of these molecules might represent the intermediate molecule of replication or recombination40. Multiple forms can also be called isomers of the genome. The cause of isomers may be the frequent recombination of some repetitive sequences in the plant mitochondrial genome promoting rearrangement of the genome41,42, which is also indirectly indicated by the near-complete disruption of gene order among closely related species43,44. Cytoplasmic male sterility (CMS) is the most evident and widespread phenotype associated with plant mitogenomic rearrangements (CMS). CMS has long been of interest to plant breeders because the male-sterile phenotype contributes to hybrid seed production. Mining whole mitogenome sequences can complement the experimental approaches39,45. In particular, they can reveal the origin, expression and evolution of CMS genes and the effect of CMS on mitogenome evolution.

Seven thousand three hundred sixty-three complete plastomes and 423 plant mitogenomes have been recorded in the GenBank Organelle Genome database (https://www.ncbi.nlm.nih.gov/genome/browse/) (last updated: December 20, 2021). The structural complexity of mitochondria results in significantly more difficulty in their genome assembly. Only a few mitochondria mitogenomes have been reported. Until now, no mitogenome in the Artemisia genus has been reported. This deficit has limited our understanding of the evolution and functioning of the mitochondria in this genus. Here, we assembled and annotated the plastome and mitogenome of A. giraldii for the first time. We analysed the gene content, repeat sequence and selection pressure of the A. giraldii mitogenome. In addition to these, we attempted to understand the evolving relationship between the plastomes and mitogenomes of Asteraceae species by constructing phylogenetic trees of 10 Asteraceae species. Lastly, we analysed the homologous sequence between the two organelle genomes. The results obtained from this study provide the first account of the mitogenome structure and shed light on the interaction between the mitogenome and plastome.

Materials and methods

Plant materials and DNA extraction and sequencing

We collected fresh A. giraldii Pamp. Leaves from the Institute of Medicinal Plant Development (IMPLAD), Beijing, China. Then, the total genomic DNA (accession number: implad201910017) was extracted using a DNA extraction Kit (Tiangen Biotech, Beijing, China) and stored in a refrigerator at − 80 °C. A DNA sequence library was constructed with 1 ug of DNA by using a NEBNext library building kit and sequenced with a 2500 platform (Illumina, San Diego, CA, USA). Clean data were obtained by removing low-quality sequences with Trimmomatic software46 under the following conditions: sequences with more than 50% bases with quality values (Q) of < 19 and more than 5% ‘N’ bases. The plant sample used for Illumina short‐read sequencing was subsequently used for Oxford Nanopore sequencing. Raw reads obtained by Nanopore sequencing were filtered to remove reads with Q of < 7. Genomic DNA was prepared using the CTAB method and purified with a QIAGEN genomic kit (Cat# 13343, QIAGEN) according to the standard operating procedure provided by the manufacturer. About 700 ng of DNA was used in library construction and then sequenced on a Nanopore PromethION sequencer instrument (Oxford Nanopore Technologies, UK) at the Genome Center of Grandomics (Wuhan, China).

Genome assembly and annotation

GetOrganelle47 was used in assembling the organelle genomes. We first used the Illumina data alone to assemble the plastome. The parameters applied for plastome were ‘-R 15 -k 21,45,65,85,105 -F embplant_pt’. Then, we applied a hybrid strategy combining Illumina and Nanopore reads to assemble the mitogenome. GetOrganelle was used in extracting mitochondrial genome reads from Illumina whole-genome sequence (WGS) data. We then assembled the extracted reads into a unitig graph. All the ‘edges’ of the unitig graph had the same coverage depth, suggesting the absence of plastid and nuclear sequences, which tend to show significantly higher or lower coverage depths. The unitig graph contained multiple double-bifurcation structures (‘>  =  <’, DBSs) resulting from the presence of repeat sequences in the genome. To resolve the sequence path around these DBS, we constructed all possible sequences around the DBSs and mapped them to the Nanopore reads with minimap2 tool48. For each DBS, we selected the sequence path with the largest number of Nanopore reads mapped as the dominant sequence path. Finally, we identified a cyclic path on the unitig graph covering all the ‘edges’. This path corresponded to a circular DNA sequence, which was considered the mitogenome.

The plastome was annotated using CPGAVAS249, and the reference genome was Chrysanthemum indicum (NC_020320.1)50. The diagrams of cis-splicing and trans-splicing genes in the plastome were created using CPGview-RSG (http://www.herbalgenomics.org/cpgview). The mitogenome was annotated using MGAVAS (http://www.1kmpg.cn/mgavas) and GeSeq (https://chlorobox.mpimp-golm.mpg.de/geseq.html)51, and its reference genome was C. indicum (MH716014.1)52. We annotated the mitogenome using MGAVAS (http://www.1kmpg.cn/mgavas/) and tRNAscan-SE53 with default settings to confirm the annotations. We used Apollo54 to manually correct the annotation problems and OrganellarGenomeDRAW (OGDRAW) (v1.3.1)55 to draw a genome map. Then, we submitted the organelle genome sequences and annotations to GenBank by BankIt (https://www.ncbi.nlm.nih.gov/WebSub/) and obtained accession numbers OK128342 for the plastome and NC_064134.1 for the mitogenome.

Homology sequence analysis between plastid and mitochondrion

Sequence similarity comparison between the plastome (OK128342) and mitogenome (NC_064134.1) was carried out for the identification of homologous sequences between two organelles. BLASTN was used, and the e-value cutoff was 1e–556. The final results were visualised using the Circos package implemented in TBtools57,58.

Repeat elements analysis

The microsatellite sequence repeats were identified by using Misa (https://webblast.ipk-gatersleben.de/misa/) with the parameters ‘1-10 2-5 3-4 4-3 5-3 6-3’59. The tandem repeats were identified using TRF with the following parameters: ‘2 7 7 80 10 50 500 -f -d -m’60. The dispersed repeats were identified using REPuter web server (https://bibiserv.cebitec.uni-bielefeld.de/reputer/) with the following parameters: hamming distance, 3; maximum computed repeats, 5000; and minimal repeat size, 30 and filtered at an e-value of 1e−461. Visualisation was conducted according to the procedure for homologous sequence analysis.

Phylogenetic inference analysis

The plastome and mitogenomes of A. giraldii combined with 11 Asteraceae species were used in phylogenetic analysis. Two Solanum genus species were selected as outgroup taxa. The common genes of 12 species were extracted using Phylosuite (v1.1.16)62. From the plastome, we extracted the coding sequences from 67 common genes (atpA, atpB, atpE, atpF, atpH, ccsA, cemA, matK, ndhA, ndhB, ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK, petA, petB, petD, petG, petL, petN, psaA, psaB, psaC, psaI, psaJ, psbA, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbM, psbN, psbT, rbcL, rpl2, rpl14, rpl16, rpl20, rpl22, rpl32, rpl33, rpl36, rpoA, rpoB, rpoC1, rpoC2, rps2, rps3, rps4, rps7, rps8, rps11, rps12, rps15, rps16, rps18, rps19, ycf3 and ycf4) from 10 Asteraceae species and two outgroup taxa for phylogenetic analysis. From the mitogenome, we extracted 29 orthologous mitochondrial genes (atp1, atp4, atp6, atp8, atp9, cox1, cox2, cox3, ccmB, ccmC, ccmFc, ccmFn, cytb, matR, mttB, nad1, nad2, nad3, nad4, nad4L, nad5, nad6, nad7, nad9, rpl10, rps3, rps4, rps12 and rps13) from the same set of species for analysis. Then, we aligned the coding sequences with MAFFT (v7)63 and concatenated them with Phylosuite (v1.1.16). We used Gblocks with default parameters to optimise the alignment of the concatenated sequences64. The phylogenetic tree was built using the maximum-likelihood method implemented in IQ-TREE (v2)65 and visualised using iTOL (v5; https://itol.embl.de/)66. Bootstrap analysis was performed using UFBoot with 1000 replicates65. The best model was selected using jModelTest (v2.1.0)67 according to Bayesian information criterion. TVM + G was found to be the best model for plastome and mitogenome analyses. We performed Bayesian inference (BI) analysis using MrBayes (v3.2.7)68. The BI tree was visualised using iTOL (v5)66.

Selective pressure analysis of A. giraldii mitogenome

We used EasyCodeML (v1.4) software69 to conduct the selective pressure analysis of 28 protein-coding genes in the mitogenome. The running model was ‘Preset (Nested Models)’. The site model in EasyCodeML can be used in identifying positively selected sites in a multiple-sequence alignment70. The required inputs for analysing selection are aligned sequences in PAML format and a tree file in Newick format. Firstly, we aligned each gene from 10 species with MAFFT (v7)63 and converted the alignment into PAML format by using the ‘Seqformat Convertor’ tool in EasyCodeML (v1.4). Then, we used IQ-TREE (v2)65 to generate a tree file in Newick format. Finally, we ran the CodeML with the following parameters: nt = 0 and icode = 0’. On the basis of the lnL and np values of the null model (M0, M1a, M7 and M8a) and alternative model (M3, M2a and M8), the likelihood ratio test (LRT) p-value of each PCG was calculated. Then, the p-values were adjusted using the Benjamini–Hochberg correction method71. Genes with adjusted p-values of < 0.05 were considered positively selected.

Molecular marker development

To discover universal primers that can be used in distinguishing the Artemisia species, we downloaded the 17 plastome sequences of Artemisia species from GenBank. They were analysed using ecoPrimers72 with the following parameters: ‘-l 300 -L 600 -e 0 -3 2 -t species -U -f -O 25’. Here, ‘-l 300’ specified the minimum barcode length as 300, excluding primers. ‘-L 600’ specified the maximum barcode length as 600, excluding primers. ‘-e 0’ specified the maximum number of mismatches allowed per primer as 0. ‘-3 2’ specified the number of nucleotides on the 3′ end of the primers as 2, and these primers should have a strict match with their target sequences. ‘-t species’ specified the taxonomic level used for evaluating barcodes and primers as ‘species’. ‘-U’ meant that no multi match of a primer on the same sequence record is allowed. ‘-f’ indicated the removal of data mining step during strict primer identification. ‘-O 25’ specified the primer length to be 25. A custom script was used to extract the regions adjacent to the identified DNA barcode region for designing PCR primers.

Hypervariable region analysis

To identify the hypervariable regions among the 18 Artemisia species, we wrote a custom script to extract the intergenic spacer regions (IGS) from the GenBank files of the 18 plastomes. Firstly, we extracted the IGS sequences using extractseq. Then, we aligned the extracted sequences using clustalw273 with options ‘-type = DNA -gapopen = 10 -gapext = 2’. Finally, we calculated the genetic distance of the intergenic regions using the K2p evolution model implemented in the distmat program from the EMBOSS package74 with the parameter ‘-nucmethod 2’. Fourteen hypervariable IGS were identified (Fig. 6). To verify whether these molecular markers can distinguish the 18 Artemisia species, we extracted the top three most variable IGS regions from 18 Artemisia species for the alignment.

Ethics approval and consent to participate

We collected fresh leaf materials from A. giraldii for this study. No specific permits were required from the local government for the collection. In addition, we conducted the study in compliance with relevant institutional, national and international guidelines and legislation. We prepared the voucher specimens and deposited them in the Institute of Medicinal Plant Development (Beijing, China) with the accession number implad201910017.

Results

DNA sequencing, genome assembly and validation

In the Illumina sequencing data, a total of 21,579,647 sequences was generated, and the total number of bases was 3,236,947,050. The average read length was 150 bp. In the Nanopore sequencing data, a total of 10.225 Gb of 1,800,259 reads were obtained, and 8.227 Gb of 1,389,001 reads had Q of > 7, which were used in subsequent analysis. The average length of the remaining reads was 5.923 kb, N50 was 14.074 kb and the longest read was 114.470 kb. We used two strategies to assemble the plastome. In the first strategy, we used Illumina data alone. In the second strategy, we used the Illumina and Nanopore reads. The assembled results were identical except that the small single-copy (SSC) region was inverted between the two assemblies (Supplementary Fig. S1A). In the mitogenome assembly, we used Illumina and Nanopore reads.

We mapped the Illumina reads to the assembly results to obtain the coverage depth of the plastome and examine the quality of the assembly (Supplementary Fig. S2). To determine the coverage depth of the mitogenome, we mapped the Illumina reads to the hybrid assembly results (Supplementary Fig. S3). The average coverage depth was 121× for the mitogenome and 430× for the plastome. For locations with low coverage depths in the mitogenome and plastome, we used Tablet software75 to visualise read cover in the genome. All low-coverage locations had spanned reads (Supplementary Figs. S2 and S3). We found more than 30 reads that covered the plastome locations with low coverage depths. By contrast, we found more than 10 reads that covered the mitogenome locations with low coverage depths. We used Bandage76 to visualise the structure of the A. giraldii plastome (Supplementary Fig. S4A) and mitogenome (Supplementary Fig. S4B). The plastome was a typical circular sequence containing a large single-copy (LSC) region, a pair of identical inverted repeats (IRs) and an SSC region (Supplementary Fig. S4A).

The unitig graph of the mitogenome showed a branched polymeric structure (Supplementary Fig. S4B). Different contigs (Supplementary Fig. S4B, left side) were linked to form a master chromosome (Supplementary Fig. S4B, right side). The principle chromosome can undergo rearrangement through repeat-mediated recombination, generating chromosomes with different rearrangements, called isomers40. We manually removed non-mitochondrial nodes from the graph according to the stratified coverage depth, and the repeat paths were resolved by aligning with the Nanopore long reads. Finally, a circular mitochondrial molecule was obtained (Supplementary Fig. S4). The master chromosome encoded 54 genes: 32 PCGs, 3 rRNAs and 21 tRNAs. The quantities were consistent with those found in other Asteraceae species.

General features of the A. giraldii organelle genomes

To understand the characteristics of the mitogenome and plastome of A. giraldii, we analysed their general features. The entire length of the plastome was 151,072 bp, and it was divided into four regions: an LSC region of 82,838 bp, an SSC region of 18,316 bp and a pair of identical 24,959 bp IRs (Fig. 1A). A total of 109 unique genes were found in the A. giraldii plastome: 78 PCGs, 27 tRNA genes, and 4 rRNA genes (Supplementary Table S1). Among these genes, 19 genes (rpl16, petD, petB, trnV-UAC, trnL-UAA, trnG-UCC, atpF, rpoC1, rps16, trnK-UUU and rpl2) had one intron, and two genes (clpP, ycf3) had two introns (Supplementary Table S2). Eleven cis-splicing genes (rpl16, petD, petB, clpP, ycf3, atpF, rpoC1, rps16, rpl2, ndhB and ndhA) were found in the A. giraldii plastome (Supplementary Fig. S5), and all these genes were PCGs. The cis-splicing genes rpl2 and ndhB had two introns. rps12 was the only trans-splicing gene identified (Supplementary Fig. S6).

Figure 1
figure 1

The circular maps of the organelle genomes of A. giraldii. (A) The circular map of the plastome. (B) The circular map of the mitogenome. The functions of the different colored genes on the map are shown on the left. The dark gray region in the inner circle indicates the GC content. The circular maps of two organelle genomes were drawn by Geseq (https://chlorobox.mpimp-golm.mpg.de/geseq.html).

The total length of PCGs in A. giraldii plastome was 78,009 bp, representing 51.64% of the whole length of the plastome sequence. By contrast, the size of the rRNA was 9046 bp, and the size of the tRNA was 2693 bp, representing 5.99% and 1.78% of the total length of the A. giraldii plastome sequence, respectively. The GC content analysis showed that the overall GC content was 37.47%. In particular, the GC content for the protein-coding regions, rRNA genes and tRNA genes was 37.78%, 55.10% and 52.73%, respectively. The GC content in the LSC, SSC and IR regions was 35.56%, 30.78% and 43.09%, respectively.

The total length of the A. giraldii mitogenome was 194,298 bp. The base composition of the entire mitogenome was A (27.26%), G (22.75%), T (27.08%) and C (22.90%). The entire GC content was 45.66%. We annotated 32 PCGs in the mitogenome (Fig. 1B). According to these functions, these 32 genes can be divided into 10 classes: ATP synthase (atp1, atp4, atp6, atp8 and atp9), cytochrome (ccmB, ccmC, ccmFc and ccmFn), ubichinol cytochrome c reductase (Cob), cytochrome c oxidase (cox1, cox2 and cox3), maturases (matR), transport membrane protein (mttB), NADH dehydrogenase (nad1, nad2, nad3, nad4, nad4L, nad5, nad6, nad7 and nad9), large subunit of ribosome (rpl5, rpl10), small submit of ribosome (rps1, rps3, rps4, rps12 and rps13) and succinate dehydrogenase (sdh4; Table 1).

Table 1 Gene composition in the A. giraldii mitogenome.

Comparison of genomic features with the other nine Asteraceae mitogenomes

Angiosperm mitogenomes vary greatly in genome structure, gene content and constitution. Variations in mitogenome size can be explained mostly by difference in length among intergenic regions25. We compared the length, GC content and PCG number of A. giraldii with the mitogenomes from nine other published Asteraceae species: Lactuca sativa, Diplostephium hartwegii, Chrysanthemum boreale, C indicum, Ageratum conyzoides, Helianthus grosseserratus, Helianthus annuus, Helianthus tuberosus and Helianthus strumosus (Table 2). The length of these 10 mitogenomes ranged from 194,298 to 363,324 bp. The largest mitogenome was from the L. sativa (363,324 bp), and the smallest was from the A. giraldii (194,298 bp) in this study. The length of A. giraldii was similar to two Chrysanthemum species, and they were all relatively small in the Asteraceae species. The GC content was relatively similar in terms of size, ranging from 44.89 to 45.66%. Meanwhile, we collated the number of PCGs in the 10 mitogenomes. The number of genes ranged from 24 in H. annuus to 35 in D. hartwegii. We determined the collinearity between A. giraldii and nine Asteraceae species by using the MAFFT (v7) online service (https://mafft.cbrc.jp/alignment/server/)77 to identify rearrangement among them. Using A. giraldii as a reference, dotplot analysis showed synteny fragment across all species (Fig. 2). Compared with the other seven Asteraceae species, C. indicum and C. boreale had larger synteny fragments. The largest fragments were about 27 kb in C. indicum and 42 kb in C. boreale. However, compared with the synteny fragments of the other seven Asteraceae species, the synteny fragments were smaller.

Table 2 Comparison of mitogenome and plastome in terms of size, GC content and number of PCGs in 10 Asteraceae plants.
Figure 2
figure 2

The dotplot graphs of collinearity between the mitogenomes of the A. giraldii and nine Asteraceae species. The vertical axis represents the A. giraldii mitogenome. The horizontal axis represents the nine Asteraceae mitogenomes, respectively. The red and blue lines showed the homologous regions in the forward and reverse direction between the A. giraldii and nine Asteraceae species, respectively. These dotplot graphs were drawn by MAFFT online service (https://mafft.cbrc.jp/alignment/server/).

Repeat sequence analysis

In addition to difference in intergenic region, diversity in mitogenome size can be attributed to a large number of repeat sequences and foreign fragments43,78. Therefore, we analysed three common types of repeated sequences. Microsatellites (simple repeat sequences, SSRs), also called tandem repeats of 1–6 bp, are abundant in the genomes of higher organisms and usually show high levels of polymorphism79. Therefore, they are generally used as molecular markers for identifying similar species80. SSRs can be classified into different types according to repeat unit. For instance, SSRs are classified into mono-, di-, tri-, tetra-, penta- and hexanucleotide repeats according to the length of their major repeat units81. We identified 36 SSRs in the plastid sequence and 51 SSRs in the mitochondrial sequence (Fig. 3, Supplementary Tables S3, S4). The most abundant SSRs in the plastome were single-nucleotide SSRs, including 19(A) and 12(T), accounting for 79.49% of the total SSRs. However, the SSRs in the A. giraldii mitogenome were dominated by tetranucleotide polymers, which accounted for 43.14% of all repeats. The types of SSRs in the mitogenomes were more evenly represented than in the plastomes.

Figure 3
figure 3

The repeat sequences of the A. giraldii organelle genomes. (A) The repeat sequences in the plastome. (B) The repeat sequences in the mitogenome. The first circle shows the dispersed repeats connected with green, orange, and purple arcs from the center going outward. The green, orange, and purple arcs represent the forward repeats, palindromic repeats, and reverse repeats, respectively. The next circle shows the tandem repeats as short bars. The third circle shows the microsatellite sequences as short bars. The scale is shown on the outermost circle, with intervals of 20 kb. The repeat sequences of the A. giraldii organelle genomes were visualized using the Circos package implemented in the TBtools.

Tandem repeat sequences exist in the DNA of all organisms whose genomes have been sequenced. These sequences consist of multiple contiguous repeat units and exhibit extremely high mutation rates in eukaryotes and prokaryotes because they tend to gain or lose repeat units82. We identified 23 tandem repeats in the plastome and 15 in the mitogenome (Supplementary Tables S5, S6). The repeats can be further tested for their suitability as DNA fingerprinting markers.

In the A. giraldii plastome, we identified 38 dispersed repeats: 18 forward repeats, 19 palindromic repeats and 1 reverse repeat (Supplementary Table S7). All the dispersed repeats in the plastome were less than 100 bp, the longest was 60 bp and the shortest was 30 bp. However, the number of dispersed repeats in the mitogenome was larger than those in the plastome. In the mitogenome, we found 135 dispersed repeats comprising 85 forward repeats, 49 palindromic repeats and 1 reverse repeat. They accounted for 62.96%, 36.30% and 0.74% of all dispersed repeats, respectively (Supplementary Table S8). The length of the dispersed repeat sequences ranged from 30 to 248 bp, but only 17 were longer than 100 bp.

Analysis of homologous sequences between two organelles

The transfer of mitochondrial and plastid DNAs to the nucleus has been considered a part of the ongoing genome evolution and influences eukaryote evolution83,84. This process not only occurs from the organelle to the nucleus but also from the plastid DNA to the mitochondrial DNA85,86. For example, the plastid gene rbcL is transferred to the mitogenome numerous times during angiosperm evolution, and all evaluated sequences are pseudogenes87. To investigate whether plastid DNA is transferred to mitochondrial DNA, we used BLASTN56 to identify potential homologous sequences between the plastome and mitogenome in A. giraldii, and the cutoff e-value was 1e-05. Nine DNA fragments were found between two organelle genomes (Fig. 4, Supplementary Table S9). The total length of the nine fragments was 4806 bp and accounted for 2.47% of the whole mitogenome. The longest fragment was 888 bp in the mitogenome, and the shortest was 79 bp. The location of the nine homologous fragments in the mitochondrial and plastid genomes is shown in Supplementary Table S9.

Figure 4
figure 4

The homologous DNA sequences between the plastome and mitogenome of A. giraldii. The homologous DNA fragments were identified by comparing the plastome and the mitogenome sequences using the program BLASTn with the e-value cutoff of 1e-05. The purple and green circles represent the mitogenome (mtDNA) and plastome (cpDNA), respectively, and the inner blue arcs show the homologous DNA fragments. The scale is shown on the outermost circle, with intervals of 20 kb. The homologous sequences between the A. giraldii organelle genomes were visualized using the Circos package implemented in the TBtools.

Phylogenetic inference analysis

We constructed phylogenetic trees with the concatenated PCG sequences, using the maximum likelihood (ML) and BI methods (Fig. 5). The phylogenetic trees constructed with plastome and mitogenome sequences had minor differences in topological structures. In both trees, the 12 species were first divided into two main clades: a large clade composed of 10 Asteraceae species and a small clade composed of two outgroup species. A. giraldii was closely related to C. indicum and C. boreale in the two trees. In the mitochondrial genome tree, H. grosseserratus and H. annuus were clustered on one branch, and H. strumosus and H. tuberosus was clustered on another branch. However, in the plastome tree, H. annuus and H. tuberosus were separated into different branches, whereas H. grosseserratus and H. strumosus were clustered in a clade. The second difference was that L. sativa was located in different positions in the two trees. In the plastid tree, L. sativa was located in the outermost clade formed by the Asteraceae family. In the mitochondrial tree, L. sativa was located within the clade formed by the Asteraceae species (Fig. 5).

Figure 5
figure 5

The phylogenetic relationships among A. giraldii and nine Asteraceae species using the maximum likelihood (ML) and Bayesian Inference (BI) methods. The sequence obtained from this study is highlighted in bold. The left is the phylogenetic tree constructed based on the coding sequences of 67 PCGs from the plastome. In contrast, on the right is the phylogenetic tree based on the coding sequences of 29 PCGs from the mitogenome. The numbers indicate the bootstrap values for the ML tree and Bayesian inference (BI) posterior probabilities for the BI tree, separated with a slash. The GenBank accession numbers of the plastomes and mitogenomes are shown after the Latin name of the related species, respectively. The length of the branch corresponds to the frequency of base substitutions. The phylogenetic trees constructed by the maximum likelihood (ML) and Bayesian Inference (BI) methods were visualized by iTOL (v5) (https://itol.embl.de/).

Selective pressure analysis of A. giraldii mitogenomic genes

To determine which genes are subject to positive selection, we calculated the LRT p-value based on the lnL and np values of the null and alternative models for 28 protein-coding genes in the mitogenome. Then the likelihood ratio test (LRT) p-values were adjusted (Supplementary Table S10). The detailed analysis results can be found in Supplementary Table S11. The adjusted p-value of ccmFc, nad1, nad6, atp9, atp1 and rps12 is below 0.05, suggesting these six genes are subject to positive selection.

Molecular marker development

Based on the 18 plastome sequences of Artemisia species, we found one molecular marker for distinguishing among 18 Artemisia species (Supplementary Table S12). It was a pair of highly conserved regions that can be used for primer design. The regions amplified by the primer pairs contained one or more SNP and INDEL sites that can be used in distinguishing among the 18 Artemisia species. However, the lengths of the regions were about 30 kb, which is extremely long for practical uses.

Analysis of hypervariable regions

A total of 14 IGS were hypervariable regions (Fig. 6). The top three regions: ndhG-ndhI, ccsA-ndhD and rpl32-trnL-UAG had K2p values of 1.50, 1.22 and 1.06, respectively. We first extracted the top three hypervariable regions and aligned them (Supplementary Fig. S7). However, the only two variant sites in ccsA-ndhD regions also existed in rpl32-trnL-UAG regions. Hence, we selected two regions: ndhG-ndhI and rpl32-trnL-UAG for molecular marker development. The variant sites in the two hypervariable regions can be used in distinguishing among the 18 species completely, including 11 SNPs and six indel sites (Supplementary Fig. S7). As indicated in Supplementary Fig. S7A, SNP 1–6 can be used in distinguishing among Artemisia. hallaisanensis, Artemisia absinthium var. calcigena, Artemisia frigida, Artemisia maritima, Artemisia argyi and Artemisia fukudo with other 17 species. Indel 1–3 can be used in discriminating among Artemisia freyniana, Artemisia lactiflora and Artemisia gmelinii. As demonstrated in Supplementary Fig. S7B, SNP7-11 can be used in identifying A. frigida, Artemisia capillaris, Artemisia stolonifera, Artemisia montana and Artemisia scoparia. Indel 4 and indel 5 can be used in identifying Artemisia selengensis and A. annua with other 17 species. After distinguishing among above 16 species, the remaining two species, Artemisia ordosica and Artemisia tangutica can be distinguished from each other by using indel 6.

Figure 6
figure 6

The hypervariable regions between the Artemisia genus. The horizontal direction represents the intergenic spacer regions that are highly variable among the 18 Artemisia species. The vertical direction is the arbitrary K2P distance of these regions. The square in the middle of each line represents the main distance of each intergenic spacer region.

Discussion

Artemisia giraldii is a medicinal plant primarily used as a source of traditional medicines. Obtaining its genomic information is the critical step for understanding the biosynthesis of its active components. As the first step, we sequenced and assembled the mitogenome and plastome of A. giraldii in the current study. Then, we analysed the mitogenome and plastome's general features and compared them in detail.

In the plastome, two copies of IRs separate SSC and LSC regions88. When an IR region is present, homologous recombination occurs between the two copies and results in the frequent ‘flip’ inversion of the SSC region between the two copies, thus allowing two heterogeneous genomic orientations to coexist in a single plant with approximately the same frequency89,90.

In this study, we used two strategies to assemble the plastome of A. giraldii. The two strategies generated two assemblies that were identical, except that the SSC region was inverted (Supplementary Fig. S1A). The reverse and complement of the SSC region in the plastome assembly from Illumina and Nanopore data generated an assembly identical to that assembled by Illumina data (Supplementary Fig. S1B). Coverage depth is an indicator used in evaluating the correctness of an assembly in the mitochondria and the plastid genome assembling process. The drop of coverage depth is often considered a sign of misassembly. We observed several regions with low depths (Supplementary Figs. S2A and S3A,B). To determine whether assembling problems occurred, we visually examined the regions. The mapped results (Supplementary Figs. S2B and S3C) showed the reads sufficiently covered cover the regions, suggesting that the regions were correctly assembled. Further examination showed that the regions were AT rich. The AT-rich regions tend to be highly polymorphic and are error prone for long-read sequencing and result in a low coverage depth91.

The mitogenome of plants is much larger than the plastome92 because of frequent exchange with nuclear and chloroplast DNA93, repeat sequences, AT-rich non-coding regions, large introns and non-coding sequences94. The mitogenome size commonly ranges from 200 to 2400 kb in angiosperms95. By contrast, the plastome size commonly ranges from 100 to 200 kb. We compared the sizes of the mitogenomes and plastomes of plants released in GenBank to determine if the small difference between the two organelles is unusual. Our results showed a small difference in size between the mitogenome and plastome in A. giraldii among the 318 species having both mitogenomes and plastomes released in GenBank by August 1st, 2022 (Supplementary Table S13).

The size difference between the mitochondria and plastid genomes in A. giraldii was extremely small, only 43,226 bp, compared with the size difference in other species. Among the 318 species, 95 showed the smaller difference between mitogenome and plastome sizes than A. giraldii. 94 of the 95 species were algae and mosses. The only angiosperm plant having a smaller size difference was Bidens pilosa from Asteraceae, with 1236 bp. Actually, its size difference was the smallest among all pairs of mitogenomes and plastomes in this study. These observations suggested that mitogenome expansion develops along with plant evolution.

Among the Asteraceae species, A. giraldii had the second smallest difference. The other seven Asteraceae species, Bidens parviflora, Bidens biternate, Bidens bipinnata, Chrysanthemum indicum, Chrysanthemum boreale, Bidens tripartite, and Ageratum conyzoides, also had small size differences between their two organelle genomes, which were 44,511, 46,989, 46,990, 57,125, 59,990, 66,297 and 67,873 bp, respectively. This result indicated that small size difference is a common phenomenon in Asteraceae. The cause of this phenomenon has not yet been reported, and thus the specific mechanisms need to be further explored.

We drew a figure to show the sizes of the seven most representative mitogenomes. The largest known mitogenome was obtained from Cucumis melo. The smallest known angiosperm mitogenome was obtained from Bidens pilosa. The sizes of four Asteraceae mitogenomes were in between (Fig. 7). The mitogenomes of different plants differ greatly in size.

Figure 7
figure 7

Comparison of mitogenome size between different species. Mitogenome sizes vary greatly among different plants. The outermost circle represents the size of the Cucumis melo mitogenome. The sizes of the circles are not drawn to scale.

We analysed the homologous sequence between mitogenome and plastome. Sequence migration is common in plants96. The plastid or nuclear DNA fragments can be inserted into mitochondrial DNA, resulting in an expanding mitogenome. These cp-derived mtDNAs can contain complete or partial PCG sequences87,97 and some tRNA sequences86. Frequently, these transfer sequences have no functions. We found nine homologous fragments between the plastid DNA and mitochondrial DNA. The total length of the nine fragments was 4806 bp and accounted for 2.47% of the whole mitogenome. To determine whether these homologous sequences originated from their common ancestor (vertical transfer) or were transferred from plastid to mitochondria (horizontal transfer), we determined whether these homologous sequences were present in the plastome and mitogenome of C. boreale with BLASTN. We found homologous sequences for eight fragments: F1, F2, F3, F5, F6, F7, F8 and F9 (Supplementary Table S9) in the plastome and mitogenome of C. boreale. We only found a homologous sequence for fragment F4 in the plastome of C. boreale. Therefore, we speculated that eight homologous fragments (F1, F2, F3, F5, F6, F7, F8 and F9) may have originated from their common ancestor and have been preserved throughout evolution. Another homologous fragment (F4) may have been transferred from the plastome to the mitogenome in A. giraldii. Thus, we suspected that a low degree of DNA exchange between the mitochondria and plastid DNAs is responsible for the low level of mitogenome expansion in A. giraldii.

Compared with plastomes and nuclear genes, the mitogenome has been rarely used in reconstruct phylogenies partly because of the slower nucleotide substitution rate and the difficulty of complete assembly and direct alignment98,99. We used the sequences of common genes to construct mitochondrial and plastid trees with ML and BI methods. A. giraldii was placed in the same locations in both trees. However, the plastid and mitochondrial trees differed in topology, particularly in the branch containing L. sativa and four Helianthus species. In the plastid tree, the L. sativa was located in the outermost clade formed by the Asteraceae family. In the mitochondrial tree, L. sativa was located within the clade formed by the Asteraceae species.

According to the taxonomy (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi), L. sativa belongs to the Cichorioideae, whereas the other nine Asteraceae species belong to Asteroideae. Hence, the plastid tree was more in line with the taxonomic classification compared with the mitochondrial tree. L. sativa and Asteroideae species are located in different branches of the phylogenetic tree100,101. To further understand the relationship of mitochondrial genomes among 10 Asteraceae species, we aligned the mitogenome of A. giraldii (NC_064134.1) by using the BLASTN suite in NCBI (https://blast.ncbi.nlm.nih.gov/Blast.cgi). The results showed that the sequence similarity between 10 Asteraceae species was consistent with those shown in the mitochondrial tree (Supplementary Table S14). Compared with the four Helianthus species and A. conyzoides, the sequence similarity between A. giraldii and L. sativa was higher.

Previous report and sequence alignment results confirm the incongruence between the plastome tree and mitogenome tree for L. sativa. We hypothesised that the difference in topology between the two trees results from the inconsistent evolutionary rates of the plastome and mitogenome. Further analysis of the mitogenome of L. sativa is required to elucidate the incongruence. However, the support value between H. grosseserratus and H. strumosus in the plastome and between H. grosseserratus and H. annuus were less than 50 because of the high sequence similarity among Helianthus species, making the branches inseparable. The A. giraldii reported in this study had the same branch structure in the two trees and had a high support value, suggesting high credibility for its evolutionary relationship. The closest relatives to A. giraldii were C. indicum and C. boreale. This result is consistent with their taxonomic relationship, as they both belong to Artemisiinae. The collinearity results confirmed this conclusion. C. indicum and C. boreale had a larger synteny fragment than the A. giraldii mitogenome. Overall, the results revealed that the gene orders on the mitogenomes of the 10 Asteraceae species differed significantly.

Most mitochondrial genes are highly conserved and have undergone neutral and negative selection. The selective pressure analysis is commonly used in identifying positively or negatively selected genes to adapt to a particular lifestyle. In this analysis, the adjusted p-values of ccmFc, nad1, nad6, atp9, atp1 and rps12 were below 0.05, suggesting that these genes underwent positive selection in the evolution process. The other 22 genes were more conserved and not subject to positive selection. The adjusted p-values of ccmFc and nad1 were 0, suggesting that they are subject to strong positive selection. ccmFc was a protein similar to the C-terminal part of the bacterial ccmF. It is involved in cytochrome c maturation and is present in a large-sized complex in wheat mitochondria102. nad1 is one of the NADH dehydrogenases and plays an important role in mitochondrial electron transport103. Given the limited availability of mitogenomes in Artemisia, we used the plastome sequences of 18 Artemisia species to predict one pair of primers that potentially amplify a variable DNA region to distinguish among 18 Artemisia species. However, the length of the predicted amplified fragment was extremely long to validate. We concluded that this molecular marker may not be applicable to distinguish them. Instead, we analysed the hypervariable regions of the 18 species to obtain available molecular markers. Owing to the large number of species, the variant site in one hypervariable region cannot be used in distinguishing 18 species from one another. The variant site in ccsA-ndhD is present in rpl32-trnL-UAG, and thus the 17 variant sites in the two hypervariable regions (ndhG-ndhI and rpl32-trnL-UAG) were combined (11 SNPs and six indels). We were able to completely distinguish among 18 Artemisia species (Supplementary Fig. S7). Further experimental verification of these molecular markers is needed.

Conclusions

In this study, we assembled the mitogenome and plastome of A. giraldii for the first time. Phylogenetic analysis showed that the branch locations of A. giraldii in the phylogenetic trees constructed with the mitochondrial and plastid protein sequences were identical, suggesting the possible co-evolution of the genomes from the two organelles. Homologous sequence analysis identified nine homologous fragments between two organelles, and one fragment might have transferred from the plastome into the mitogenome. This study may provide a reference for studying the evolutionary relationship between mitochondria and plastids in Asteraceae species.