Comparison of Four Complete Chloroplast Genomes of Medicinal and Ornamental Meconopsis Species: Genome Organization and Species Discrimination

High-throughput sequencing of chloroplast genomes has been used to gain insight into the evolutionary relationships of plant species. In this study, we sequenced the complete chloroplast genomes of four species in the Meconopsis genus: M. racemosa, M. integrifolia (Maxim.) Franch, M. horridula and M. punicea. These plants grow in the wild and are recognized as having important medicinal and ornamental applications. The sequencing results showed that the size of the Meconopsis chloroplast genome ranges from 151864 to 153816 bp. A total of 127 genes comprising 90 protein-coding genes, 37 tRNA genes and 8 rRNA genes were observed in all four chloroplast genomes. Comparative analysis of the four chloroplast genomes revealed five hotspot regions (matK, rpoC2, petA, ndhF, and ycf1), which could potentially be used as unique molecular markers for species identification. In addition, the ycf1 gene may also be used as an effective molecular marker to distinguish Papaveraceae and determine the evolutionary relationships among plant species in the Papaveraceae family. Futhermore, these four genomes can provide valuable genetic information for other related studies.

www.nature.com/scientificreports www.nature.com/scientificreports/ Recent chloroplast genomic research has provided large quantities of data that are useful for selecting pertinent markers to resolve obscure phylogenetic relationships in seed plants 9 . At present, nearly 3000 complete chloroplast genomes are available in the NCBI database (https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup. cgi?taxid=2759&opt=plastid) 10 . However, there is only one sequence from the chloroplast DNA of Meconopsis species in GenBank 11 .
In this study, we sequenced and assembled the chloroplast genomes of four Meconopsis species using a next-generation sequencing platform. We report the assembly, annotation and analysis of the chloroplast genomes of Meconopsis racemosa, Meconopsis integrifolia (Maxim.) Franch, Meconopsis horridula and Meconopsis punicea. We also constructed phylogenetic trees to perform comparisons among chloroplast genomes published for other plant species in related families. This study expands our understanding of the diversity of chloroplast genomes of Meconopsis species and their evolutionary relationships and provides fundamental data for the genetic engineering of Meconopsis chloroplasts.

Results and Discussion
Chloroplast genome sequencing, assembly and validation. Using  The complete sequences of the four chloroplast genomes were assembled by both de novo and reference-based assembly. Gaps were validated using PCR-based sequencing with one primer pair (Supplementary Table 1 Chloroplast genome structural features and gene content. It was previously reported that the chloroplast genomes of angiosperms are conserved in their genomic structure in terms of gene number and order, although IR expansion or contraction occur frequently 12,13 . The Meconopsis chloroplast genomes are in accordance with this observation, and their genome structures are similar to those of other Papaveraceae species 14 . All of the Meconopsis chloroplast genomes display the typical quadripartite structure of angiosperm cpDNA, which consists of a pair of IR regions (51306-51988 bp) separated by an LSC region (82809-83982 bp) and an SSC region (17729-17898 bp). These four chloroplast genomes are highly conserved in gene content, gene order, and intron number. The Meconopsis chloroplast genomes harbor 127 genes, 90 coding proteins, 37 coding tRNAs and 8 coding rRNAs. Some genes are duplicated in the IR region, among which ten are protein-coding genes (rpl2, rpl12, rps12, rps15, rps16, rps19, ndhB, ycf1, ycf15 and ycf2), four are ribosomal RNA genes (rrn4.5, rrn5, rrn16, rrn23) and six are transfer RNA genes (trnL-CAA, trnN-GUU, trnR-ACG, trnA-UGC, trnI-GAU and trnV-GAC) ( Table 1). Fifteen protein-coding genes (petB, petD, ndhA, ndhB, atpF, rps12, rps15, rps16, rps19, rpl2, rpl12, rpl16, rpoC1, clpP, and ycf3) contain one or more introns. The A content ranged from 30.4 to 30.5%, the C content ranged from 19.7 to 19.8%, the G content ranged from 18.8 to 19%, the T content ranged from 30.8 to 31%, and the GC content ranged from 38.5 to 38.8%, indicating nearly identical levels among the four Meconopsis chloroplast genomes (Table 2).
Amino acid abundance and codon usage. Codon usage plays an important role in shaping chloroplast genome evolution. Mutational bias has been reported to have an essential role in this process 15 . As shown in Supplementary Tables 2-5, the 90 protein-coding genes are encoded by 26338, 26365, 26342 and 26337 codons in the chloroplast genomes of M. racemosa, M. integrifolia (Maxim.) Franch, M. horridula and M. punicea, respectively. Leucine (11.1-9.5%) was the most abundant amino acid among the proteins encoded by the chloroplast genes. Cysteine (1.2-1.7%) was the least abundant amino acid in the proteins encoded by chloroplast genes in the M. racemosa, M. integrifolia (Maxim.) Franch, M. horridula and M. punicea chloroplast genomes. Leucine and isoleucine are the most commonly observed amino acids in the proteins of chloroplast genomes of angioperms 16 .
We calculated and summarized the codon usage of the chloroplast genomes in these four plants (Fig. 2). The codon UUA, for leucine, occurred at the highest proportion in all four species (27.1-30.3%). There were a total of 711 codons encoding tRNA genes in the M. racemosa, M. integrifolia (Maxim.) Franch and M. horridula chloroplast genomes, but only 704 codons in the tRNA-encoding genes in M. punicea (Supplementary Tables 2-5), indicating that codons ending in U and A were common; perhaps the variation in the tRNA-encoding genes is related to species evolution.
We also calculated the relative synonymous codon usage (RSCU) in the chloroplast genomes of the four species. Usage of the start codon methionine AUG and tryptophan UGG had no bias (RSCU = 1). All Plastid RNA editing prediction. RNA editing is a generic term comprising a variety of processes that alter the DNA-encoded sequence of a transcribed RNA by inserting, deleting or modifying nucleotides in a transcript 17 . Chloroplast RNA editing was first discovered in 1991. Nearly 30 years after the discovery of C-to-U editing in plant chloroplasts, the field has recently expanded tremendously in several research directions 18 . RNA editing provides a way to create transcript and protein diversity 19 Tables 6-9). In these four species, www.nature.com/scientificreports www.nature.com/scientificreports/ The light gray inner circle corresponds to the AT content, and the dark gray circle corresponds to the GC content. Genes belonging to different functional groups are shown in different colors. www.nature.com/scientificreports www.nature.com/scientificreports/ the amino acid conversion from S to L was the most frequent type of conversion. As previously reported, with increased amino acids, the conversion from S to L becomes more frequent 21 . This finding indicated that the evolutionary conservation of RNA editing is essential 22,23 . simple sequence repeats and repetitive sequence analysis. Tandem repeat sequences consisting of 1-6 nucleotide repeat units are known as simple sequence repeats (SSRs), or microsatellites 24 . SSRs are valuable molecular markers with a high degree of variation within species and have been used in many population genetics and polymorphism investigations. Using the MISA software tool, we analyzed the occurrences and types of SSRs in the four Meconopsis chloroplast genomes. These genomes all have SSRs, and the majority of which are monoand dinucleotide repeats, which were identified 88 and 29 times, respectively. The mononucleotide repeats were A/T repeats, and 82.8% of the dinucleotide repeats were AT/AT repeats (Table 3). Although the AT richness in the SSRs of the four chloroplast genomes of Meconopsis species was similar to that identified in previous studies, which suggested that SSRs found in the chloroplast genome are generally composed of polythymine (T) or polyadenine (A) repeats 25 Table 3). These findings indicate that SSRs can be used as molecular markers to identify these plant species.

Meconopsis punicea
More complex and longer repeat sequences may play an important roles in sequence divergence and genomes 26 . In these four Meconopsis chloroplast genomes, we found that the length of repeated sequences ranged mainly from 30 to 90 bp, similar to the lengths reported in other angiosperm plants 25,27,28  Divergent hotspots in the Meconopsis chloroplast genome. Molecular markers with nucleotide diversity over 1.5% have been reported as highly variable regions that can be used for phylogenetic analysis and species identification in seed plants 29,30 . Currently, there are few molecular biology-based studies of Meconopsis plants, and there is no uniform molecular marker for species identification [31][32][33][34][35] .
A SNP (single nucleotide polymorphism) marker is a single base change in a DNA sequence, typically with two possible nucleotide alternatives at a given position 36

Self-replication
Large subunit of ribosome (LSU) rpl14, rpl16 a , rpl2 a,b , rpl2 a,b , rpl20, rpl22, rpl23 b , rpl23 b , rpl32, rpl33, rpl36  www.nature.com/scientificreports www.nature.com/scientificreports/ divergence levels, the nucleotide variability values within 800 bp in all four chloroplast genomes were calculated with DnaSP 6.10.03 software. The values ranged from 0 to 0.07, revealing slight differences among the genomes. For example, the p-distance between M. racemosa and each of M. integrifolia (Maxim.) Franch, M. horridula and M. punicea is 0.016, 0.001 and 0.018, respectively. These divergence hotspot regions can provide information for marker development for phylogenetic analyses of Meconopsis species. Overall, the results reveal higher divergence in noncoding regions than in coding regions. Using whole chloroplast genomes, we found that some regions differ among the four species, such as rps16, trnC-GCA, trnD-GCU, trnT-GGU, rps15, accD-PsaI and petA (Fig. 4a). The coding regions with marked differences include the matK, rpoC2, petA, ndhF and ycf genes (Fig. 4b). These genes could be utilized as potential phylogenetic markers to reconstruct the phylogeny in this genus. Qu Yan et al. reported that the ndhF gene could not be used to distinguish M. racemosa from M. horridula 37 . However, our present study shows that the sequence of the ndhF gene in the chloroplast genome differs between these two species is distinct. Divergent hotspots of chloroplast genomes have been used to identify species in other plants of the Papaveraceae family. Jianguo Zhou et al. used ycf1, rpoB-trnC, trnD-trnT, petA-psbJ, psbE-petL and ccsA-ndhD sequences in the chloroplast genome to distinguish Papaver orientale and Papaver rhoeas 14 . Zhe Zhang et al. 38 analyzed the phylogeny of 15 species from the Papaveraceae family based on the nuclear gene ITS sequence, the chloroplast gene rbcL sequence, and the combined sequences of these genes.
Next, we used the online program mVISTA to analyze gene order and content in the chloroplast genome. We found that the gene order and contents of the Meconopsis plants are similar to those of other members of the Papaveraceae family (Fig. 5). Similar to other plant species, all Meconopsis species have conserved chloroplast genomes, their coding regions are more conserved than their noncoding regions, and their IR regions are more conserved than their LSC and SSC regions 16,39,40 . Altitude and plant distribution. Altitude influences ecological factors such as water and temperature, which affects plant genetic variation and population differentiation 41 44 , which is consistent with both the phylogenetic results of this study and the altitudes of their distributions. Although they are distributed in the same region, there is evident genetic isolation among them. We speculate that altitude may be an important ecological factor that affects the evolution of Meconopsis plants. In addition, we found that M. racemosa, M. horridula and M. racemosa (MH394401) 11 were grouped together. For several years, the delimitation of M. racemosa and M. horridula in the genus has been highly controversial 46 . Fedd, Kingdon-Ward and Prain et al. considered M. racemosa and M. horridula to be the same species 46 . However, in Tibetan Flora, M. racemosa is described as a variant of M. horridula. M. racemosa and M. racemosa (MH394401) 11 were distributed on different branches but are the same species. Incomplete lineage sorting, insufficient informative characters, hybridization or plastid capture could be responsible for the incongruent phylogenetic positions of this species 47,48 .
We used the five gene markers (matK, rpoC2, petA, ndhF and ycf1 genes), screened by divergent hotspots in the Meconopsis chloroplast genomes, to construct five phylogenetic trees of these four Meconopsis plants and five other plants from the Papaveraceae family (P. somniferum, P. rhoeas, P. orientale, Macleaya microcarpa and Coreanomecon hylomeconoides) using Decaisnea insignis, Euptelea pleiosperma and Nuphar advena as outgroups ( Fig. 7 and Supplementary Figs 1-4). The results showed that M. racemosa, M. racemosa (MH394401) 11   www.nature.com/scientificreports www.nature.com/scientificreports/ Among the five genes, the rpoC2 gene is not a suitable for potential DNA barcoding of Meconopsis plants, and the ycf1 gene has the highest node support value in the phylogenetic tree, which is consistent with previous reports that have used ycf1 to distinguish unknown Papaveraceae plants 14,49 . In Tibetan Flora, M. racemosa is described as a variant of M. horridula on account of the similar morphological characterization of these taxa and the consistent ITS sequence. However, Dou et al. 35 , using the ITS2 sequence, and Ni et al. 34 , using the psbA-trnH sequence, constructed an evolutionary trees and found that these taxa clustered in different branches.
The chloroplast genome usually contains uniparentally inherited DNA, which is well suited for studying the evolutionary history of plants, such as dating a common ancestor 50 . Yuan et al. used the chloroplast genome sequence of trnL-trnF and found that M. punicea is the mother of the hybrid species Meconopsis × cookei (Papaveraceae) and that M. quintuplinervia is the father 33 .

Conclusions
In this study, we used the Illumina HiSeq 2000 system to sequence the complete chloroplast genomes of four Meconopsis species: M. racemosa, M. integrifolia (Maxim.) Franch, M. horridula and M. punicea. We demonstrate that these four Meconopsis species are divided into two groups, with M. racemosa and M. horridula in one group and M. integrifolia (Maxim.) Franch and M. punicea in the other. By comparing the chloroplast genome sequences, we were able to retrieve all genetic resources, including SNPs, SSRs, repetitive sequence, codon usage, RNA editing prediction, 'hotspot' regions and phylogenomic analysis. These resources will provide chloroplast genome molecular markers for the identification of these Meconopsis species. We also used four hotspot genes (matK, petA, ndhF and ycf1) to construct phylogenetic trees and clearly distinguish these species.
With the development of plant science, plastid transformation is becoming an important tool. The limited availability of complete chloroplast genomic information is one of the major factors preventing the extension of this technology to valuable plants. The Meconopsis chloroplast genome data obtained in this study could be applied in biotechnology and provide useful information for designing transformation vectors in the future. Chloroplast genome assemblage and annotation. For these four species, the high-throughput sequencing data were qualitatively assessed and assembled using NOVOPlasty 2.6.3. Gaps in the cpDNA sequences were filled by PCR amplification and Sanger sequencing. The annotations of the chloroplast genomes  www.nature.com/scientificreports www.nature.com/scientificreports/ Codon usage. Codon usage was determined for all protein-coding genes. The relative synonymous codon usage (RSCU) values and codon usage were determined with MEGA7, which was used to reveal the characteristics of the variation in synonymous codon usage 54 . simple sequence repeats and repetitive sequence analysis. Chloroplast microsatellites were identified in a high-quality sequence of clusterbean by using the MISA Perl script 55 . The minimum numbers for the SSR motifs were 10, 5, 4, 3, 3 and 3 for mono-,di-,tri-,tetra-,penta-,and hexanucleotide repeats, respectively. REPuter was used to identify forward repeats, reserve sequences, complementary and palindromic sequences, with a minimum repeat size of 30 bp and 90% sequence identity 56 Table 10) were used for phylogenetic analysis. All of the coding sequences from the 42 species were aligned with the MAFFT method based on codons by Geneious 8.0.4. The best nucleotide substitution model (GTR + G + I) was tested, and a maximum likelihood (ML) tree (1000 bootstrap replicates) was constructed with RAxML software 60 . BI analyses were conducted using GPU MrBayes. The GTR + I + G substitution model was used for BI. In the BI analyses, two simultaneous runs of 10000000 generations were conducted for the matrix. Each set was sampled every 1000 generations with a burn-in of 25%. The matK, rpoC2, petA, ndhF and ycf1 gene sequences of M. racemosa, M. integrifolia(Maxim.) Franch, M. horridula, M. punicea and 9 other species were collected from NCBI. Maximum likelihood (ML) analyses were conducted using RAxML software with the GTR model 61 . www.nature.com/scientificreports www.nature.com/scientificreports/