Comparative analysis of complete chloroplast genome sequences of four major Amorphophallus species

Amorphophallus (Araceae) contains more than 170 species that are mainly distributed in Asia and Africa. Because the bulbs of Amorphophallus are rich in glucomannan, they have been widely used in food, medicine, the chemical industry and so on. To better understand the evolutionary relationships and mutation patterns in the chloroplast genome of Amorphophallus, the complete chloroplast genomes of four species were sequenced. The chloroplast genome sequences of A. albus, A. bulbifer, A. konjac and A. muelleri ranged from 162,853 bp to 167,424 bp. The A. albus chloroplast (cp) genome contains 113 genes, including 79 protein-coding genes, 30 tRNA genes and 4 rRNA genes. The A. bulbifer cp genome contains 111 genes, including 78 protein-coding genes, 29 tRNA genes and 4 rRNA genes. A. muelleri contains 111 and 113 genes, comprising 78 and 80 protein-coding genes, respectively, 29 tRNA genes and 4 rRNA genes. The IR (inverted repeat) region/LSC (long single copy) region and IR/SSC (short single copy) region borders of the four Amorphophallus cp genomes were compared. In addition to some genes being deleted, variations in the copy numbers and intron numbers existed in some genes in the four cp genomes. One hundred thirty-four to 164 SSRs (simple sequence repeats) were detected in the four cp genomes. In addition, the highest mononucleotide SSRs were composed of A and T repeat units, and the majority of dinucleotides were composed of AT and TA. SNPs (single nucleotide polymorphisms) and indels (insertion-deletions) were calculated from coding genes and noncoding genes, respectively. These divergences comprising SSRs, SNPs and indel markers will be useful in testing the maternal inheritance of the chloroplast genome, identifying species differentiation and even in breeding programs. Furthermore, the regression of ndhK was detected from four Amorphophallus cp genomes in our study. Complete cp genome sequences of four Amorphophallus species and other plants were used to perform phylogenetic analyses. The results showed that Amorphophallus was clustered in Araceae, and Amorphophallus was divided into two clades; A. albus and A. konjac were clustered in one clade, and A. bulbifer and A. muelleri were clustered in another clade. Phylogenetic analysis among the Amorphophallus genus was conducted based on matK and rbcL. The phylogenetic trees showed that the relationships among the Amorphophallus species were consistent with their geographical locations. The complete chloroplast genome sequence information for the four Amorphophallus species will be helpful for elucidating Amorphophallus phylogenetic relationships.

SCIENTIFIC REpoRts | (2019) 9:809 | DOI: 10.1038/s41598-018-37456-z (ITS1) and plastid sequences (rbcL and matK) revealed a new subgeneric delineation by large-scale phylogenetic analysis of Amorphophallus 5 . The genome size of Amorphophallus is quite large, approximately 20 times larger than the rice genome 6 . Furthermore, large variation exists in the genomic sequences of Amorphophallus species. Therefore, sequencing the whole genome of Amorphophallus species is very difficult. Complete sequencing of chloroplast (cp) genomes is much easier to achieve in Amorphophallus species. The plant chloroplast is a key plastid involved in photosynthesis and carbon fixation 7 . Chloroplast genomes are more conserved than nuclear genomes and contain four important regions: a large single-copy (LSC) region, a small single-copy (SSC) region and a pair of inverted repeats (IRA, IRB) 8 . The cp genome contains important information and genetic markers for phylogenetic and taxonomic analyses between plant species and individuals [9][10][11] because of the low rates of polymorphisms, indels and SNPs in cps. More than 800 cp genomes have been sequenced and deposited in the NCBI. The first cp genome was discovered in Zea mays 12 , and a complete sequence was determined in Nicotiana tabacum and Marchantia polymorpha 13,14 . A circular cp genome of Aquilaria sinensis was found to be 159,565 bp long and contained 82 protein-coding genes. Zhang et al. reported sequences for five Epimedium species cp genomes, which provided valuable genetic information for accurately identifying species and assisted in the utilization of Epimedium plants 15 . These complete cp genome sequences have been widely used in the development of molecular markers for phylogenetic research 16,17 . Because of the ability for intracellular gene transfer and the conservation, diversity, and genetic basis of chloroplasts, transgene development has allowed for the engineering of high-value agricultural or biomedical products 18 . With the advent of high-throughput sequencing technology, it has become both standard practice and inexpensive to obtain cp genome sequences.
In this study, for the first time, we sequenced the complete cp genomes of four major Amorphophallus species using high-throughput sequencing technology and the Illumina HiSeq2500 platform. This study had four aims: (1) determine the size range and structure of four Amorphophallus species cp genomes; (2) compare the variations of simple sequence repeats (SSRs) among four major Amorphophallus cp genomes; (3) examine the indels and SNPs among four major Amorphophallus cp genomes; (4) confirm the phylogenetic relationship among four Amorphophallus species, as well as other species, using the complete cp genomes. These results will provide valuable and basic sequence information for taxonomic study and the development of molecular markers for further species identification of Amorphophallus. After the completion of the whole cp genome sequence, it is possible to build a database of the species. Based on the differences in the gene sequences of the four cp genomes, a DNA barcode can easily be developed to allow for the building of an efficient platform for postgenomics species research, such as subsequent gene excavation and functional verification of DNA sequence information.

Results and Discussions
Organization of four chloroplast genomes. Approximately 2G of data for each cp genome was obtained with a 300 bp read length. Gap closing was based on the sequence of the complete cp genome from Colocasia esculenta (NC_016753) 19 . The chloroplast genome sequences of the four genomes ranged from 162,853 bp (A. bulbifer) to 167,424 bp (A. konjac) (Fig. 1, Table 1). The same typical quadripartite structure was displayed in the four cp genomes. Two IR regions (25,379-26,120 bp) were separated by an LSC region (90,467-92,660 bp) and an SSC region (21,628-22,839 bp) ( Table 1). The IRB region was 39 bp longer than the IRA region in the A. konjac cp genome. The IR/LSC and IR/SSC borders of the four Amorphophallus cp genomes were compared (Fig. S1). The variation of the IR/LSC and IR/SSC borders was considered to be the primary mechanism causing the length differences of angiosperm cp genomes 20 . The GC content ranged from 35.39% to 35.90% for the four cp genomes (Table 1). These four Amorphophallus cp genome data were deposited in GenBank.
Divergence hotspots in four chloroplast genomes. The A. albus cp genome contains 113 genes, including 79 protein-coding genes, 30 tRNA genes and 4 rRNA genes. The A. bulbifer cp genome contains 111 genes, including 78 protein-coding genes, 29 tRNA genes and 4 rRNA genes. Both the A. konjac and A. muelleri cp genomes contain 112 genes, comprising 79 protein-coding genes, 29 tRNA genes and 4 rRNA genes. All of the features are shown in Table 1 and annotated in Fig. 1. All these genes play different roles in the chloroplast, and the classification is shown in Table 2.
The estimated deletion of some genes was detected in some Amorphophallus cp genomes ( Table 3). The ycf1 gene was present in three cp genomes but not in the A. bulbifer cp genome. Another gene named trnL-CAA appeared in the A. albus and A. konjac cp genomes. The trnG-GCC gene was lost in the A. konjac cp genome. The accD gene was found only in the A. muelleri cp genome, and psbE was missing only in the A. konjac cp genome. The rpl2 and rpl23 genes were annotated in the IRA and IRB regions of the four cp genomes, but they were only found in the IRA region and were lost in the IRB region in the A. albus cp genome.
In addition to some genes being deleted, variations in the copy numbers and intron numbers of some genes were also found in the four cp genomes. Eight protein-coding genes, four rRNA genes, nine tRNA genes and two putative genes were present in two copies. Moreover, trnT-GGU was found to have two copies only in the A. bulbifer and A. muelleri cp genomes. In addition, three copies of the rps12 gene were found. Fifteen genes contained introns, including four tRNA genes, ten protein-coding genes and one putative gene. The psbF and ycf2 genes containing one intron were only found in the A. muelleri cp genome. There were no introns in the clpP gene in the A. bulbifer and A. muelleri cp genomes, but there were two introns in this gene in the A. albus and A. konjac cp genomes, while ycf3 and infA had two introns in each of the four cp genomes. All of the above divergences are shown in Table 2. The development of molecular markers for the identification of Amorphophallus species was much easier based on the divergence hotspot regions of the four Amorphophallus cp genomes.  (Table 4). Among these SSRs, mono-, di-, tri-, tetra-, and hexanucleotides were detected. The mononucleotide SSRs were most common, with 70.15% of the SSRs observed in A. bulbifer. In addition, most of the mononucleotide SSRs were composed of A and T repeat units, and the majority of the dinucleotides were composed of AT and TA. The cp SSRs are normally composed of short polyA or polyT repeats 22 .
Higher contents of A/T and AT/TA repeats in cp SSRs were also detected in the Metasequoia glyptostroboides cp genome 23 . Hexanucleotide repeat unit SSRs were in the A. muelleri cp genome only at a portion of 2.08%. In short, the cp SSRs represented rich variation and were absolutely useful for polymorphism analysis in the Amorphophallus species.
Using the A. albus cp genome as the reference sequence, we compared the SNP/indel loci of the four cp genomes. SNP markers were detected in 65 protein-coding genes in A. bulbifer, A. konjac and A. muelleri cp genomes. Eleven genes were in the SSC region, and 54 genes were in the LSC region, indicating that the protein-coding genes in the IR region were more conserved. These 65 genes were divided into four categories according to their different functions in plant chloroplasts, including photosynthetic apparatus, photosynthetic metabolism, gene expression, and other genes. Nine hundred sixty-nine and 943 SNP markers were detected between A. albus and A. bulbifer in protein-coding genes and noncoding regions, respectively. One hundred and four and 176 SNP markers were detected between A. albus and A. konjac in protein-coding genes and noncoding regions, respectively. Nine hundred and seventy-eight and 926 SNP markers were detected between A. albus and A. muelleri in protein-coding genes and noncoding regions, respectively. The SNPs in the A. konjac cp genome were significantly fewer than those in the A. bulbifer and A. muelleri cp genomes. One hundred and fifty-nine SNP sites were found in Oryza. sativa and Oryza. nivara chloroplast genomes 24 , 591 SNP markers were detected between the Solanum tuberosum and S. bulbocastanum plastomes 25 , and 464 were detected between the plastomes of P. ginseng and P. notoginseng 26 . , trnK-UUU, trnL-CAA a (Aa, Ak), trnL-UAA * , trnL-UAG, trnM-CAU a , trnN-GUU a , trnP-GGG, trnP-UGG, trnQ-UUG, trnR-ACG a , trnR-UCU, trnS-GCU, trnS-GGA, trnS-UGA, trnT-UGU, trnT-GGU (Aa, Ab a , Ak, Am a ), trnV-GAC ,a , trnV-UAC * , trnW-CCA, trnY-GUA Small subunit of ribosome rps2, rps3, rps4, rps7 a , rps8, rps11, rps12 b , rps14, rps15, rps16, rps18, rps19 Large subunit of ribosome rpl2 (Aa * , Ab *, a , Am *, a , Ak *, a ), rpl20, rpl23 (Aa * , Ab a , Am *, a , Ak *, a ), rpl33, rpl36, rpl14, rpl16, rpl22, rpl32  All of the SNPs were classified into two types, including synonymous (S) and nonsynonymous (N) ( Table 5, Fig. 2). For the 969 and 978 SNP markers in the gene coding regions of the A. bulbifer and A. muelleri cp genomes, respectively, 696 and 708 belonged to the nonsynonymous type, and 273 and 270 belonged to the synonymous type. Synonymous and nonsynonymous SNP makers from the gene coding genes shared very similar numbers in these two cp genomes. There were 32 synonymous SNPs and 72 nonsynonymous SNPs in the protein coding regions of the A. konjac cp genome. Forty-eight nonsynonymous and 47 synonymous SNP sites were detected in the Machilus cp genome, implying that a substitution constraint mechanism existed 27 . Genes ycf3, rpoC1 and clpP were detected with SNP markers in their introns. Six, 1 and 6 SNP markers were found in one intron from rpoC1; 6, 1 and 5 SNP markers were found in one intron from ycf3; 23, 7 and 25 SNP markers were found in two introns from clpP in the A. bulbifer, A. konjac and A. muelleri cp genomes, respectively. ClpP and ycf1 were the variation hotspots for SNPs and indels, and they were usually used for investigating sequence variation in seed plants 28,29 .
cpSSR and SNP markers will be useful in testing maternal inheritance of the cp genome, identifying species differentiation and even in breeding programs 30 . cpSSRs have been demonstrated to be useful in gene flow studies to estimate seed and pollen contribution 31 and in phylogeographic analyses 32 .
Twenty-two protein-coding genes from three Amorphophallus cp genomes contained indels (Table 6). Only two coding genes were detected to contain indels in the A. konjac cp genome; one indel was in rps15, and two indels existed in ycf1. The indel numbers of each coding gene from the A. bulbifer and A. muelleri cp genomes are shown in Fig. 3A,B. The gene ycf1 was a hotspot for indel variation, and almost half of the number of indels existed in this gene (Fig. 3). Such mutational loci in cp genomes showed highly variable regions in the genomes. ndhK regression in four Amorphophallus species cp genomes. The ndhK gene was a new gene represented in the four Amorphophallus species cp genomes. It was 744 bp in length in the A. albus and A. konjac cp genomes, and 741 bp in length in the A. bulbifer and A. muelleri cp genomes. The gene ndhK is present in a novel protein complex of the thylakoid membrane and shows homology to a mitochondrial gene that encodes a subunit of the NADH-ubiquinone oxidoreductase of the mitochondria 33 . ndhK was reported as a gene encoding a subunit of PSII, but later, this protein was classified as a subunit of NADH dehydrogenase, and the gene has been renamed ndhK 34,35 . In many plants, such as Glycine max, Epimedium acuminatum, Psilotum nudum, Machilus yunnanensis, Actinidia chinensis, Veronica persica, and Aquilaria sinensis (Lour.) Gilg., ndhK was lost from their cp geno mes 15,18,27,[36][37][38][39] . ndhK has been found in the Paramecium aurelia mitochondrial (mt) genome 40 . The presence of this gene in the mt genome raises interesting questions concerning its evolutionary origin. The gene ndhK may play a crucial role in photosynthesis in four Amorphophallus species, and its presence in the cp genomes can be used as a marker for distinguishing them from other family species.
Phylogenetic analysis. Phylogenetic analysis of Amorphophallus species has been reported using different aspects, such as several chloroplast genes 41 , two chloroplast genes, leafy intron sequences 42 , plastid DNA markers and fingerprinting 43 . These studies simply demonstrated Amorphophallus sample species relationships and did not include the four Amorphophallus species that are the major commercial cultivation species used in our study. In addition, whole chloroplast sequences were much more accurate than individual gene sequences for phylogenetic analysis. In the present study, complete cp genomes sequences of four Amorphophallus species and other plants (Table S1) were used to perform phylogenetic analyses (Fig. 4). The clade of the four species of Amorphophallus was grouped with other Araceae species as expected. A. albus and A. konjac were clustered into one clade, and A. bulbifer and A. muelleri were clustered into another clade. These results showed that A. albus and   A. konjac had a close relationship, and A. bulbifer and A. muelleri were closely related. The matK and rbcL genes were also used for phylogenetic analysis among the Amorphophallus genus (Figs S3 and S4). Both of the phylogenetic trees indicated that the Amorphophallus species were grouped into three major clades named Africa, southeast Asia, and Continental Asia. The Continental Asia clade covered the taxa distributed from India to China and Thailand, which were subdivided into two subclades, Continental Asia I and II. The four Amorphophallus species in our study were all derived from the Chinese mainland; A. albus and A. konjac were grouped as Continental Asia I, and A. bulbifer and A. muelleri were grouped as Continental Asia II. The first two species came from the central region of China, and the other two species were collected from the southern region of China near Burma. The matK and rbcL genes well supported clades in consensus trees and the resolution of ingroup relationships within Amorphophallus 44 . All the results suggested that the relationship in Amorphophallus was consistent with the biogeographical distribution. A. konjac and A. bulbifer were also classified in two different clades by Sedayu 42   the propagation coefficient in A. bulbifer and A. muelleri increased significantly because of aerial bulbs growing in the stems. The aerial bulbs diminish the need for sexual reproduction and lead to a significantly increased reproductive capacity. In many cases, the evolutionary process is closely linked with the reproduction system of the species. A. muelleri and A. bulbifer reproduce, thus far, through apomictic processes. The corm of A. bulbifer is light red, and that of A. muelleri is light yellow. These phenotypes also demonstrated the relationship among the four Amorphophallus species. The sequenced cp genomes of the four Amorphophallus species provide a large amount of genetic information for phylogenetic analysis and taxonomic study.

Conclusion
We sequenced the chloroplast genomes of four Amorphophallus plants: A. albus, A. bulbifer, A. konjac, and A. muelleri. We annotated the four cp genomes and analyzed the structural divergence among the four cp genomes; moreover, we identified the SSR loci and SNPs in protein-coding genes. These SSRs and SNPs could be selected for use in developing markers and in phylogenetic analysis. Comparing the cp genomes among some plants suggested that psbG regressed in the A. albus, A. konjac, A. bulbifer and A. muelleri cp genomes. We also detected that some genes and introns were lost, in addition to copy differences of some genes among the four cp genomes.   in a different clade. The clustering analysis results verified the results of the SNP data. All the data will be very helpful in further research on Amorphophallus plants and chloroplasts and in expanding our understanding of the evolutionary history of the Amorphophallus cp genomes. All of these divergences in the four cp genomes were significant for taxonomic and evolutionary studies, as well as for genetic engineering developments in the future.

Plant cp genome assembly and annotation.
Trimmomatic v 0.32 45 was used for raw data processing, and the resulting clean data were used for assembly and post analysis. Fastqc v0.10.0 46 was used to evaluate the quality of the data visually. Velvet v1.2.07 47 was used to assemble the clean data, and the complete chloroplast genome sequence was obtained after gap closing. DOGMA 48 was used to annotate the cp genomes and predict the rRNA/tRNA of A. albus, A bulbifer, A. konjac, and A. muelleri. COGs (clusters of orthologous groups of proteins) were analyzed through rpsblast v2.2.30+ 49 . The circular cp genome maps were drawn using the OrganellarGenomeDRAW program 50 .