Ulmaceae includes approximately 16 genera and 230 species that are primarily distributed in the tropical-to-cold temperate zone of the Northern Hemisphere. Currently, eight genera of Ulmaceae are found in China, including Ulmus, Celtis, Aphananthe, Trema, Gironniera, Zelkova, Hemiptelea, and Pteroceltis. These genera include 46 species and ten varieties1 distributed throughout the country, and Ulmus accounts for nearly half of these species. Elms generally exhibit extensive adaptability and strong resistance2,3, mainly in afforestation and landscape greening applications4,5. In addition, most types of elm woods are hard, delicate, wear-resistant, tough, and excellent in quality, and can be used for furniture, construction, and bridges6. Numerous beneficial substances can be found in the bark and root bark of elms, many of which have high medicinal value7,8. The phloem of elm has high viscosity and can be used as a natural plant adhesive, and the leaves can be used as animal feed9. In addition, the seed oils of Gironniera, Ulmus, Aphananthe, and Celtis can be used for industrial purposes10.

Plant palynological fossils and other studies have documented that elms have existed since approximately the third century of the geological age11,12. As an ancient Tertiary tree family, Ulmaceae is rich in germplasm resources. The large numbers of naturally occurring polyploids and mutants13,14 and interspecific and intraspecific hybrids15 lend themselves to extensive elm varieties worldwide, with complex genetic backgrounds16,17,18. However, because previous plant classification and identification methods focused on morphological characteristics, pollen characteristics, and flavonoid differential substances19 but generally lacked molecular identification. Many differences and controversies exist in the evolution and classification of Ulmaceae plants20,21,22,23,24, including the attribution of Ulmus, Pteroceltis, Gironniera, Trema, and Aphananthe25,26, and the classification and species determination of Ulmus vary widely27,28.

In this study, we sequenced, assembled and annotated the cp genomes of U. parvifolia, H. davidii, U. lamellosa, U. castaneifolia and U. pumila ‘zhonghuajinye’, and compared their sequences with related species. Moreover, this present study using the cp genome to construct the evolutionary tree aimed to improve our understanding of evolution within Ulmaceae species. The plant-specific cp genome is relatively independent of the nuclear genome. Compared to nuclear genome sequences, the cp genome exhibits a low molecular weight, low nucleotide substitution rate and slow structural variation; therefore, it is increasingly used to solve deep phylogenetic problems within plants29,30,31. Besides, the structural characteristics and variation of the cp genomes of Ulmus and Ulmaceae species were preliminarily documented to obtain comprehensive understanding the structure of plastomes within Ulmaceae, which will help to lay the foundation for the accurate identification of Ulmus and Ulmaceae species classification and genome evolution.

Materials and methods

Test materials

Hemiptelea davidii, Ulmus parvifolia, Ulmus lamellosa, Ulmus castaneifolia and Ulmus pumila ‘zhonghuajinye’ (Fig. 1) were used as the focal experimental species. In May 2019, young and healthy mature leaves on annual branches of each sample were selected from the Germplasm Resources Nursery of the Hebei Forestry and Grassland Science Research Institute. All methods were carried out in accordance with relevant guidelines and regulations.

Figure 1
figure 1

Leaves characteristic of four Ulmus species and H. davidii.

DNA extraction and Illumina sequencing

The leaves were cleaned with ultrapure water and then immediately placed into liquid nitrogen and stored at − 80 °C. A plant DNA extraction kit (TIANGEN Biotech, Beijing, China) was used to extract the total DNA from fresh young leaves of each sample. The integrity and quality of total DNA were detected using agarose gel and a NanoDrop2000 microspectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA). The qualified samples were sent to Beijing Zhongxing Bomai Technology Co., Ltd. (Beijing, China) for cp genome sequencing using the Illumina HiSeq PE150 double-end sequencing strategy.

Chloroplast genome assembly, annotation and visualization

Clean reads were filtered using Trimmomatic ver. 0.33 software32 to acquire clean reads by deleting adaptors and low quality reads. GetOrganelle33 was used to assemble cp genome sequences, which were then annotated using GeSeq software34. HMMER and ARAGORN v1.2.3835 were used to ensure the accuracy of the predictions for the encoded protein and RNA genes, respectively. Moreover, the Chloroplotc36 was used to draw the cp genome maps. Finally, the newly obtained cp genomes were uploaded to the NCBI database.

Sequence and genome comparison analyses

The single sequence repeats (SSRs) were determined using MISA37 among the cp genomes of 23 Ulmaceae species. The parameter settings for single mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide and hexanucleotide repeats were ten, six, five, five, five and five, respectively. REPuter38 was to identify and locate the repeat sequences among Ulmaceae species including forward repeats (F), reverse repeats (R), palindromic repeats (P) and complement repeats (C) and the following parameters were used: (1) 30 bp minimum repeat size and (2) 90% or greater sequence identity (Hamming distance = 3). Tandem Repeats Finder ver. 4.0439 was used to analyze and detect tandem repeats, with the default parameters. The mVISTA software40 (Frazer et al., 2004) was used to examine the genetic divergence among Ulmaceae species using U. pumila as reference, in the LAGAN model. We also conducted a window analysis to identify the nucleotide diversity (Pi) among the cp genomes of 21 Ulmaceae species using DnaSP v5.10 software41.

Ka/Ks and positive selection on plastid genes

A total of 77 protein coding genes from 23 cp genomes of Ulmaceae species were selected for positive selected genes (PSGs) identification and analysis. First, MAFFT v742 was used to compare the amino acid sequences of each gene. PhyML v3.0 software43 was then used to construct the phylogenetic tree based on the maximum likelihood (ML) method for the above multiple-sequence alignment results. Subsequently, trimAl v1.444 was used for trimming, and PAML v4.9 CodeML was used for branch-site analysis. The parameters of Model A and Model A null in branch site were Model A (Model = 2, NSsites = 2, fix/omega = 0, omega = 2) and Model A null (Model = 2, NSsites = 2, fix/omega = 1, omega = 1). The likelihood ratio test (LRT) of paml chi2 (chi2 d.f.2ΔlnL) was used to obtain the LRT P value. False discovery rate correction was performed on the LRT P value. Gene with P value < 0.05 was selected as PSG. Lastly, the posterior probabilities of amino acid sites were calculated using Bayes Empirical Bayes (BEB) to determine whether the sites were positively selected.

Phylogenetic analyses

23 Ulmaceae species were selected from the NCBI database (Table S1). The phylogenetic trees were constructed with Arabidopsis thaliana as an outgroup. The cluster analyses were conducted based on the whole cp genome sequence, common protein genes (accD, atpA, atpB, atpE, atpF, atpH, atpI, ccsA, cemA, clpP, matK, ndhA, ndhB, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK, petA, petB, petD, petG, petL, petN, psaA, psaB, psaC, psaI, psaJ, psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbL, psbM, psbN, psbT, psbZ, rbcL, rpl14, rpl16, rpl20, rpl22, rpl23, rpl2, rpl32, rpl33, rpl36, rpoA, rpoB, rpoC1, rpoC2, rps11, rps12, rps14, rps15, rps16, rps18, rps19, rps2, rps3, rps4, rps7, rps8, ycf1, ycf2, ycf3 and ycf4) and common genes in IR region (ndhB, rpl2, rpl23, rps12, rps7, ycf1 and ycf2). MAFFT v7 was used to align the cpDNAs sequences under default parameters42, and the alignment was trimmed by Gblocks/0.91b to remove low-quality regions with the parameters: − t = d − b4 = 5 − b5 = h45. The Maximum-likelihood (ML) method was performed for the phylogenetic analyses using PhyML v3.043. Nucleotide substitution model selection was estimated with jModelTest 2.1.1046. The model GTR + I + G was selected for ML analyses with 1,000 bootstrap replicates to calculate the bootstrap values (BS) of the topology. Moreover, the results were treated with iTOL 3.4.347.

Results and analysis

Chloroplast characteristics of Ulmus species

In the present study, the cp genomes of H. davidii, U. parvifolia, U. lamellosa, U. castaneifolia and U. pumila ‘zhonghuajinye’ were sequenced, assembled and annotated. As shown in Fig. 2 and Table 1, the cp genomes of the five species were covalently closed double-chain cyclic molecules with a typical four-segment structure, and the sizes ranged from 159,113 to 160,388 bp (Table 1). U. lamellose had the largest genome, while U. pumila ‘zhonghuajinye’ had the smallest. The lengths of the LSC in each segment varied greatly (87,736–88,466 bp), with a difference of 730 bp. The longest LSC occurred in U. lamellosa, followed by U. castaneifolia, U. pumila ‘zhonghuajinye’, H. davidii, and U. parvifolia. The lengths of the SSC region ranged from 18,485 to 19,024 bp, with a difference of 539 bp. And the variation range of the SSC region was smaller than that of the LSC region. Among them, U. lamellosa had the largest SSC region and U. pumila ‘zhonghuajinye’ had the smallest. Besides, the smallest IR region occurred in U. pumila ‘zhonghuajinye’ (26,317 bp), while the largest was found in H. davidii (26,622 bp). The cp genome of H. davidii with a total of 130 genes contained the smallest number of genes of the five species, while the other four species had 131 genes each. The five species contained 85–86 protein-coding genes, 37 tRNAs and eight rRNAs. In addition, the coding region was longer than the non-coding region and the coding region (36.62–36.74%) had significantly higher GC content than the non-coding region (33.96–34.48%). Moreover, the GC content in rRNA was higher than that in tRNA.

Figure 2
figure 2

The cp genome maps of Ulmus species. The species name and specific information regarding the genome (length, GC content, and the number of genes) are depicted in the center of the plot. Extending outward, the four layers are the nucleotide diversity of U. pumila ‘zhonghuajinye’, U. parvifolia, U. castaneifolia and U. lamellosa respectively.

Table 1 The basic characteristics of the cp genomes of four Ulmus species and H. davidii.

In addition, the total GC contents of the five species were similar, ranging from 35.30 to 35.62% which was higher than in the LSC and SSC regions, but lower than in the IR region. Moreover, the first position had the highest GC content than the second and third positions (Fig. 3). Comparative analysis indicated that gene structure was relatively conservative and most genes did not contain introns. In this study, the number of genes containing introns were 23. Among these, the clpP and ycf3 genes contained two introns. The other genes contained only one intron that primarily involved 13 coding genes (rps16, atpF, rpoC1, rpl2 × 2, ndhB × 2, rps12 × 2, ndhA, petB, petD and rpl16) and eight tRNA genes (trnK, trnG, trnL, trnV, trnI × 2 and trnA × 2). The length of ndhA intron was the longest, followed by rpl16 and trnK (Fig. 4).

Figure 3
figure 3

The GC (%) content in different positions of CDs region of species within four Ulmus species and H. davidii.

Figure 4
figure 4

Intron length in the chloroplast genomes of four Ulmus species and H. davidii.

Gene loss and the Ka/Ks ratios of ulmaceae species pairwise

The protein-coding genes of the 23 Ulmaceae species including 15 Ulmus species were counted. The results were shown in Fig. 5. As it was shown, the gene of ndhC was lost in U. laciniata. In addition, the infA was lost in three species (H. davidii, G. subaequalis and A. aspera) with different degree.

Figure 5
figure 5

Loss of chloroplast protein-coding genes in the phylogeny of 23 Ulmaceae species.

The Ka/Ks ratios, which provided information on the effects of selection pressures on protein coding genes of each 23 Ulmaceae species pair, were calculated (Fig. 6). The results showed that the higher Ka/Ks ratios were detected in Ulmus species pairs than non-Ulmus species pairs.

Figure 6
figure 6

Pairwise Ka/Ks ratios in Ulmus and other Ulmaceae species.

Positive selection analysis of protein sequence among Ulmaceae species

Seventy-seven common CDS genes from 23 Ulmaceae species were subjected to positive selection analysis (Table 2 and Supplementary Table S2). And Model A and Model A null were calculated using codeML. The results showed that no genes were positively selected. However, the BEB analysis indicated that two protein-coding genes (rps15 and rbcL) had significant posterior probabilities and there was a positive selection site in each gene. Besides the rps15 and rbcL genes were located in the SC region.

Table 2 The potential positive selection test based on the branch-site model.

Repeat sequence analysis of Ulmaceae species

A total of 64–133 SSRs were identified in the cp genome of the 21 Ulmaceae species, with lengths of 10–29 bp, including mononucleotides, dinucleotides and trinucleotides. The mononucleotide repeats ranged from 63 to 126, followed by dinucleotide (2–9) and trinucleotide (1–3) repeats (Fig. 7A). The mononucleotides repeats were mostly composed of (A)n and (T)n, with only one (G)11-type SSR in G. subaequalis; one (G)10-type SSR in P. tatarinowii, T. orientalis, U. elongata and U. pumila ‘zhonghuajinye’; one (C)11-type SSR in A. aspera and U. parvifolia; and one (C)14-type SSR in U. lanceaefolia. Dinucleotide repeats included 11 SSRs of (AT)n and (TA)n of different lengths. Besides, trinucleotide repeats included (AAT)n, (ATA)5 and (TAT)5 SSRs of different lengths (Fig. 7B).

Figure 7
figure 7

(A) Analysis of SSRs in the cp genomes of 21 Ulmaceae species; (B) The numbers of SSRs types of 21 Ulmaceae species; (C) Frequency of SSRs in the LSC, IR and SSC region; (D) Frequency of SSRs in intergenic regions, protein-coding genes and introns regions.

The statistical results for the SSR distribution in the LSC, SSC and IR regions of the cp genome indicated that the SSRs in the 21 Ulmaceae species were mainly distributed in the LSC region with 44–107 SSRs, accounting for 69–83% of the total; followed by the SSC region with 11–27 SSRs, accounting for 15–22%; and the IR region with 0–8 SSRs, accounting for 0–12%. SSRs in H. davidii were only distributed in the LSC and SSC regions (Fig. 7C). In addition, SSRs were primarily distributed in intergenic regions ranging from 39 to 102 SSRs, while 9–31 occurred in introns and 9–22 occurred in CDS (Fig. 7D).

In the 21 Ulmaceae species, palindrome repeats (P), forward repeats (F), reverse repeats (R) and complement repeats (C) of repeat sequences were observed. C. biondii was the only species that lacked C repeats. The total number of repeat sequences ranged from 46 to 89 (21–35 of type P, 17–41 of type F, 1–17 of type R and 0–11 of type C), with G. subaequalis containing the fewest and U. gaussenii and U. chenmoui containing the most number (Fig. S1A). Moreover, the lengths of repeats primarily ranged from 30 to 39 bp, although three repeats were longer than 200 bp in U. americana, U. gaussenii and U. castaneifolia (Fig. S1B).

Chloroplast genomic divergence and hotspots regions

The mVISTA was used to compare and analyze the divergent regions of plastomes among the 23 Ulmaceae species with U. pumila as a reference. (Fig. 8). Overall, the 23 Ulmaceae species could be roughly divided into two groups: one containing 15 Ulmus species and two Zelkova species species; the other containing H. davidii, A. aspera, C. biondii, G. subaequalis, P. tatarinowii and T. orientalis. Significant separation was observed between the two groups. And the results showed that the cp genomes of Ulmus, Zelkova and Hemiptelea species were more conserved than the species of other group. In terms of region variation, the variation range of the LSC and SSC regions were greater than that of the IR regions. Moreover, the conservation of gene-coding regions was generally higher than that of non-coding regions. For example, the non-coding regions of trnH/psbA, trnK/rps16 and trnS/trnG exhibited large variation and could be used as an alternative region for DNA barcoding at later stages. Although the gene-coding region was overall highly conserved, the conservativeness of the ycf1 and ndhD genes was poor. These noncoding region and gene-coding region obtained could also be used as alternative regions for DNA barcoding of Ulmus and Ulmaceae species.

Figure 8
figure 8

Sequence identity plots of the 23 Ulmaceae species genomes generated using mVISTA, taking the annotation of U. pumila as a reference. Grey arrows above the alignment indicate the orientation of genes. Blue bars represent exons, pink ones represent non-coding sequences (CNS). X-scale axis represents the genome coordinate positions, Y-scale axis represents the percent identity within 50–100%.

To further clarify the diversity of Ulmus and Ulmaceae species at the sequence level, the nucleotide difference (pi) of the 15 Ulmus species and 23 Ulmaceae species were calculated respectively and suitable polymorphic loci from protein-coding sequences, IGS regions and intronic regions were identifed. The results showed that the most of the regions with the high nucleotide diversity among 15 Ulmus species were included from IGS regions, namely trnH/psbA, rps16/trnQ, trnS/trnG, trnG/trnR, rpoC1-intron, trnC/petN, ycf3-intron1, rps4/trnT, ndhC/trnV, psbE/petL, ndhF/rpl32, rpl32/trnL. The protein-coding regions of ndhD were also included in the suitable polymorphic loci (Fig. 9A, Table 3). What is more, these variation locis were mainly distributed in LSC and SSC region.

Figure 9
figure 9

Nucleotide variability (pi) values of 15 Umlus species and 23 Ulmaceae species. (A): pi values of 15 Umlus species. (B): pi values of 23 Ulmaceae species. X-axis: name of the regions; Y-axis: nucleotide diversity. window length: 300 bp; step size: 200 bp.

Table 3 High variable marker of cp genomes among 15 Ulmus species.

In addition, We also compared all the regions of cp genomes of the 23 Ulmaceae species in pairwise alignment. the cp genome variation primarily occurred in intergenic regions (Fig. 9B, Table 4), such as trnH/psbA, trnK/rps16, rps16/trnQ, trnS/trnG, trnG/trnR, trnT/psbD, psbZ/trnG, rps4/trnT, trnT/trnL, ndhC/trnV, accD/psaI, ycf4/cemA, psbE/petL, ndhF/rpl32, rpl32/trnL and ndhA-intron. In the coding regions, the most variable gene was ycf1 which showing that the gene-coding regions were more conservative than the non-coding regions. Thus, these region could be used as a potential molecular marker for the identification and phylogenetic analysis of Ulmus and Ulmaceae species.

Table 4 High variable marker of cp genomes among 23 Ulmaceae species.

Phylogenetic analysis of Ulmaceae species

To reveal the developmental relationship of Ulmaceae species, the phylogenetic tree based on the whole cp genome sequences, common protein-coding genes and common genes in IR region of 23 Ulmaceae species were constructed using the ML method. The results of three phylogenetic trees were nearly similar to a certain extent (Fig. 10). The 23 Ulmaceae species could be divided into two branches: one included Ulmus, Zelkova and Hemiptelea, among which Hemiptelea was the first to differentiate; and the other included Celtis, Trema, Pteroceltis, Gironniera and Aphananthe. Of the three trees, the one based on the whole cp genome and the common protein genes were more similar, and the U. lanceaefolia and U. elongata had the different locations. U. lanceaefolia was differentiated after Zelkova in Fig. 10A, while in Fig. 10B the U. lanceaefolia was differentiated after Zelkova and U. elongata. Besides the genetic relationship between C. biondii, T. orientalis, P. tatarinowii were different. The phylogenetic relationship of Ulmus species constucted based on IR region was different from the above two methods (Fig. 10C). For example, the U. chenmoui had a more closer relationship with U. glabra and U. americana.

Figure 10
figure 10

Phylogenetic trees of 23 Ulmaceae species. (A) Phylogenetic tree constructed using common protein coding genes; (B) Phylogenetic tree constructed using the whole cp genome; (C) Phylogenetic tree constructed using common genes in IR region.


Cp genome variation of Ulmaceae species

In the present study, the cp genome size, structure and composition of the four Ulmus species and H. davidii were highly conserved, displaying a typical quadripartite structure with a LSC, a SSC region and two IR regions, which was similar to the other angiosperms48. The cp genome of the five species ranged from 159,113 to 160,388 bp, encoding 130–131 genes, including 85–86 protein coding genes, 37 tRNAs and eight rRNAs. In particular, rps12 in Ulmaceae was recognized as the trans-spliced gene, which was in consistent with observations in other species49. The five species shared the similar GC content (about 35%). Besides, the overall difference in cp genome size was 1275 bp and the difference in LSC length was 730 bp, accounting for the majority of the cp genome variation. Therefore, the differences in cp genome length of the five species were primarily caused by variation in LSC length based on IR contraction or expansion50. In this study, the gene introns of the five species were compared and analyzed and the results indicated that most genes do not contain introns. There were only 23 genes harbored introns and no intron loss was found in the five species. Among them the clpP and ycf3 gene contained two introns, which is similar with the other plants51. Intron sequences were valuable in phylogenetic studies at lower taxonomic levels (e.g., closely related genera and interspecies)52. Huang et al.53 analyzed the phylogenetic relationship of the four species among Amana by combining partial DNA fragments of ITS nuclear sequence and trnL intron sequence, and proved that Amana wanzhensis was an effective species. Moreover, Huang et al.54 confirmed that the intron of the ndhA gene was a promising DNA barcode for Fagopyrum phylogenetic research. In this study, the ndhA (1233–1570 bp) gene had the longest introns. And the length of the ndhA gene intron varied the 337 bp among the five species. In the future, the intron of the ndhA gene may similarly used as a DNA barcode for the phylogenetic study of Ulmus, which will serve to facilitate the identification and utilization of natural Ulmus resources. The phenomenon of gene loss was common in most plant55. In the present study, the infA and ndhC gene were lost in different species, which was also occured in previous reported 56.

Identification of repeated sequences among Ulmaceae species

cpSSRs, which are uniparentally inherited materia and widely distribute in the genome of eukaryotes, with the characteristics of simple structures, small molecular weight and relative conservation, are short tandem repeats of 2 to 6 bp and widely used in species identification, genetic difference analysis at the individual level and population evolution studies57,58. In this study, a total of 64–133 SSRs were found in the cp genomes of 21 Ulmaceae species, including mononucleotide, dinucleotide and trinucleotide types. The numbers of mononucleotides were the largest among all the types and contributed to AT richness, which was similar to previous results59,60. The distribution of SSR loci in different regions was uneven and primarily occurred in the LSC region, SSC region and intergenic region, and less so in the IR region, gene region and introns. In addition, previous studies had reported that new genes had been generated from repetitive sequences, and SSR loci were more distributed in SCs, which may be one reason for their greater variation compared to the IR region61.

Adaptative evolution of the Ulmaceae plastome

In CodeML, there were four common models including branch model, site model, branch-site model and clade model. Among them, the branch-site model was usually used to assess potential positive selection of genes, in which the nonsynonymous and synonymous rate ratio (ω = dN/dS) was used to measured selection pressure and the ratio ω < 1, ω = 1, ω > 1 were considered to be purifying selection, neutral selection and positive selection, respectively62,63. Then the BEB method was further used to assess whether sites were under positive selection64. The analysis of adaptive evolution of genes is of certain value for studying the changes of gene structure, gene function, and evolutionary track of species65. The plastid genes with positive selection signature suggested that in response to the environment these genes might be undergoing adaptative evolution66. The cp genome was highly conserved and few genes with positive selection were identified, which is consistent with other studies67. For example, it was found that the rpoB, matk, ndhF, rps18, rps7, ycf4, clpP and rbcL genes were positively selected61. And the rpoB and matK gene has been used as DNA barcodes in phylogeny reconstruction of plants68,69. In this study, the positive selection analysis of 77 protein-coding genes among 23 Ulmaceae species indicated that there was no positively selected gene but the rps15 and rbcL had positive selection sites, which is consistent with the study of Xie et al.70, in which there were no significant p-values, while, some genes like petA, rps4, ndhE and rpoC1 were found with positive selection sites in the BEB test. The rps15 was different types of small subunit ribosomal structural proteins. In addition to playing an important regulatory role in ribosomal biosynthesis, the gene were also involved in regulating a variety of cellular life processes, such as genome integrity and development71,72,73. Besides, the rbcL gene (large subunit of ribose-1,5-diphosphate) was located in the large region outside the reverse repeat sequence, which encoded the large subunit of Rubisco. Eight rbcL and eight rbcS genes encoded by nuclear genes constitute Rubisco, which mainly catalyzed the fixation of carbon dioxide during photosynthesis and the oxidation of carbon during photorespiration. The sequence of the rbcL gene had been widely used in molecular systematics research to detect the systematic relationship and molecular evolution between plants74,75. Wu et al.76 used a single fragment of rbcL to obtain a phylogenetic tree of mangrove plant with a higher average node support rate than the matK and trnH-psbA fragments, which could accurately distinguish different tree species.

Identification of hotspots

DNA barcoding has been widely used in species identification, resource classification and phylogenetic evolution77. Cp genome thus plays an important role in the development of DNA barcoding. For example, the highly variable loci identified through sliding window and mVIST analysis in cp genome could be used as candidate markers for molecular markers, DNA barcoding and evolutionary analysis. Among them the molecular evolution rate of coding region and non-coding region is different, which is suitable for the phylogenetic study of different order. The coding region is suitable for the phylogenetic research of families, orders and even higher taxonomic levels, while the non-coding region is suitable for the phylogenetic research of genera and species78. For example, a phylogenetic tree based on the combined sequences of trnL-trnF and accD-psaI in the chloroplast noncoding region further confirmed the independent evolution of Eastern pear and Western pear from the maternal evolutionary background79. The matK gene, which exhibited rapid evolution and high polymorphism, was widely used as an important marker gene in evolutionary research and species identification80. Moreover, The regions such as matK, rbcL and trnK/rps16 have been proved to be commonly used as DNA barcodes in plant identification81. In this study, the result of alignment and nucleotide diversity revealed the sequencing five species had high level of similarity. It is similar to the other species that the LSC and SSC regions were more variable than the IR regions, whereas the coding regions were more conservative than the non-coding regions82. Some polymorphic regions by comparison of 15 Ulmus species were also identified using the sliding window and mvista analysis. The most divergent regions were trnH/psbA, rps16/trnQ, trnS/trnG, trnG/trnR, rpoC1-intron, trnC/petN, ycf3-intron1, rps4/trnT, ndhC/trnV, psbE/petL, ndhF/rpl32, rpl32/trnL and protein-coding gene ndhD. Among them, trnH-GUG/psbA, trnS/trnG and ndhF/rpl32 had already been screened as a suitable barcode for plants83,84,85. The trnH/psbA is widely used as a phylogenetic marker in the Asteraceae family86. These hotspot regions obtained in our study could be used as DNA border in plant identification and system evolution in Ulmus species.

Phylogenetic analysis

The base substitution rate in the maternally inherited cp genome was much lower than that in the nuclear genome. Therefore, the cp genome had become an important basis for phylogenetic analysis of higher plants. In the Flora Reipublicae Popularis Sinicae (FRPS), Ulmus species were divided into four sections: Blepharocarpa, Chaetoptelea, Microptelea, and Ulmus. Section Ulmus was further divided into three series: Glabrae, Lanceaefoliae, and Nitentes. Among the five species sequenced in this study, U. parvifolia belongs to Sect. Microptelea; U. castaneifolia belongs to Ser. Nitentes of Sect. Ulmus; U. lamellosa and U. pumila ‘zhonghuajinye’ belong to Ser. Glabrae of Sect. Ulmus, and H. davidii is the only species of Hemiptelea, which is consistent with the results of constructing evolutionary trees from the cp genomes of 23 species. However, several differences existed. First, U. lanceaefolia belongs to Series Lanceaefoliae of Section Ulmus in the FRPS, but our results indicated that it did not belong to Section Ulmus. This discrepancy may be due to the fact that U. lanceaefolia was an evergreen plant, unlike other Ulmus species. A large amount of intraspecific variation in photosynthetic genes and intergenic regions of chloroplast genomes had been reported for other evergreen species87, leading to differences in evolutionary relationships. The second discrepancy was that U. gaussenii belongs to Series Glabrae of Section Ulmus in the FRPS, but our results indicated that this species was clustered into a small branch with U. castaneifolia and U. chenmoui of Series Nitentes. This result was consistent with classifications of Ulmus species based on leaf morphology, wood anatomical structure, and pollen morphology88,89,90. Based on the results of this study, U. lanceaefolia could be listed as a new Ulmus section or as a new genus of Ulmaceae in parallel with Zelkova and Hemiptelea. Furthermore, U. gaussenii could be included in Series Nitentes. However, the cp genome may not contain enough genetic information to thoroughly analyze the evolutionary relationship of Ulmaceae species; therefore, it is necessary to use nuclear genome information for further classification research.