Introduction

Astragali Radix (AR), also known as Huangqi, is one of the most popular herbal medicines worldwide. As indicated in the Chinese pharmacopeia, AR is composed of dried roots of two Astragalus membranaceus varieties, namely, A. membranaceus (Fisch.) Bunge var. membranaceus and A. membranaceus (Fisch.) Bunge var. mongholicus (Bunge) P. K. Hsiao1. More than 100 compounds, including flavonoids, saponins, polysaccharides, and amino acids have been identified in AR. In addition, various biological activities of these compounds have been reported2,3,4. Traditionally, AR is used to treat weakness, wounds, anemia, fever, multiple allergies, chronic fatigue, loss of appetite, and uterine bleeding and prolapse5. Meanwhile, calycosin is the major bioactive isoflavonoid isolated from AR, and its potential pharmaceutical properties in the treatment of tumors, inflammation, stroke, and cardiovascular diseases have recently gained increasing attention6.

With the growing demand for AR, the raw materials for AR production are rapidly diminishing in China. Meanwhile, the cultivated Astragalus has become an important source of commercial AR in China7. A. membranaceus (Fisch.) Bunge var. mongholicus (Bunge) P. K. Hsiao is the most widely cultivated variety, although several other varieties have also been used as the raw material for commercial AR production. The inherent differences among these varieties might cause drug efficacy and safety issues. Unfortunately, the lack of molecular markers distinguishing the various varieties of A. membranaceus has hindered genetic diversity studies on A. membranaceus, and at least partly contributed to the gradual loss of some varieties. Thus, identification of molecular markers in AR is important not only in the screening of high-quality varieties of Astragalus but also in the conservation of wild Astragalus.

Previous studies suggested that chloroplast genome sequences, which have increasing phylogenetic resolution at lower taxonomic levels, are effective tools in plant phylogenetic and genetic population analyses8. The typical chloroplast genome in angiosperms has a conserved quadripartite structure, with two copies of an inverted repeat (IR) separating the large single copy (LSC) and small single copy (SSC) regions9. These genomes usually encode 120–130 genes with sizes in the range of 120–170 kb. The gene content and gene order are generally conserved, although a number of variations at the genome and gene levels among the chloroplast genomes in legumes have been reported. These variations include the loss of one copy of the IR10, the occurrence of inversions of 50 kb11 and 78 kb long12, the loss of the infA13, rpl22, and rps16 genes14, and the loss of introns, such as those in the rpl2, clpP, and rps12 genes14,15.

Here, we sequenced and annotated the complete chloroplast genome of A. membranaceus (Fisch.) Bunge var. mongholicus (Bunge) P. K. Hsiao as a first step to identify genetic markers that can distinguish the varieties of A. membranaceus. The genomes are highly conserved in terms of the genic and genomic structures compared with those from other Papilionoideae species. Comparative genomic analyses showed that this genome belongs to the inverted-repeat-lacking clade (IRLC). In addition, two inversions and numerous gene losses have been identified. However, these inversions and/or gene loss events are probably associated with Papillonidae as a whole, as we did not find any such events that can distinguish A. membranaceus from other Papillonidae species. Most importantly, we have identified five intraspecific hypermutation regions and 262 simple sequence repeat (SSR) loci. Three of the hypermutation loci are heteroplasmic. These hypermutation regions could be used as effective markers to study the genetic diversities among A. membranaceus varieties.

Results

General features of the A. membranaceus chloroplast genome

Unless specified, A. membranaceus refers to A. membranaceus (Fisch) Bunge var. Mongholicus (Bunge) P. K. Hsiao in this paper for simplicity. The chloroplast genome was completely sequenced by a combination of de novo assembly and gap filling, as described below. All raw sequence reads were mapped up to the final assembly, and a total of 6,023,406 out of 15,000,362 (40.2%) pair-end reads were successfully mapped. The remaining unmapped reads possibly represent contaminant mitochondrial or nuclear DNAs (data not shown). In addition, the average coverage is approximately 9800. The complete chloroplast genome sequence is 123,582 bp long with only one copy of the IR region. Moreover, a total of 110 genes were identified, including 76 protein-coding genes, 30 transfer RNA (tRNA) genes, and four ribosome RNA (rRNA) genes (Table 1). The general structure and locations of the 110 genes in the chloroplast genome are depicted in Fig. 1. The LSC (bases 1–80986), SSC (bases 109810–123582) and IR region (80987–109809) regions are shown. The IR region is defined by the stop codons of genes rps19 and ycf1. Meanwhile, the genes rps16 and rpl22, which are found in most angiosperm plastid genomes including representatives of the early-branching lineages16,17,18, are absent in A. membranaceus. In addition, a total of 17 genes in A. membranaceus chloroplast genome have only one intron (Table S1); ycf3 is the only with two introns. Similarly, introns in the 3′-end of rps12, a trans-splicing gene are also absent. Moreover, the accD gene of A. membranaceus encodes a protein with 451 amino acids, which is shorter than the other accD proteins (Fig. S1). Furthermore, infA was not found in the chloroplast genome of A. membranaceus; this gene codes for translation initiation factor 1 and is suspected to be an example of chloroplast-to-nucleus gene transfer13. The implication of this finding needs further investigation.

Table 1 Genes predicted in the chloroplast genome of A. membranaceus.
Figure 1: Schematic representation of the A. membranaceus chloroplast genome.
figure 1

The predicted genes are shown and colors represent functional classifications, which are shown at the left bottom. The genes drawn outside the circle are transcribed clockwise, whereas those drawn inside the circle are transcribed counter-clockwise. The inner circle shows the GC content. The large single copy (LSC), small single copy (SSC) and inverted repeat (IR) regions are shown in the inner circle. The three hypermutation regions (AB, BC and DE) are indicted with arrows.

Overall, 60.5% of the A. membranaceus chloroplast genome sequence is composed of genes that code proteins. The overall GC content of the A. membranaceus chloroplast genome comprises 34.1%, whereas the protein-coding regions comprise 36.0%. Within the protein-coding regions, the GC contents for the first, second and third positions of the codons comprise 44.9%, 37.3% and 27.4%, respectively. The codon usage and codon-anticodon recognition pattern of the A. membranaceus chloroplast genome are summarized in Table S2. The 30 tRNA genes contain codons corresponding to all 20 amino acids that are necessary for biosynthesis. Among these genes, six contain an intron, as follows: trnK-UUU, trnC-ACA, trnL-UAA, trnT-CGU, trnE-UUC, and trnA-UGC. The lengths of these introns range from 543 bp to 2494 bp.

Repeat and SSR analysis

SSRs are valuable molecular markers of high-degree variations within the same species and have been used in population genetics and polymorphism investigations19. We analyzed the occurrence, type, and distribution of SRRs in the A. membranaceus chloroplast genome and the distribution of SSRs in 13 other IRLC chloroplast genomes belonging to Papilionoideae. In total, 262 SSRs were identified in A. membranaceus chloroplast genome (Table S3, Table 2). Among these SSRs, the majority consisted of mono- and di- nucleotide repeats, which were found 148 and 89 times, respectively. Tri- (12), tetra- (11), penta- nucleotide repeat sequences (1) were found with much lower frequency. This observed pattern is similar to those observed in 13 IRLC chloroplast genomes of other species belonging to Papilionoideae (Table S4). Most mononucleotide repeat sequences consisted of A/T repeats (99.3%). Similarly, 86.5% of the dinucleotide repeat sequences consisted of AT/AT repeats (Table S3). Our findings are in agreement with the previous findings that the chloroplast SSRs are generally composed of short polyA or polyT repeats and rarely contained tandem G or C repeats20. In this study, we also analyzed the locations of 24 tri-, tetra- and penta- nucleotides in the chloroplast genome, and the results are shown in Table 2. Among these nucleotides, 21 are localized in the intergenic regions, and 3 are in the coding regions.

Table 2 Distribution of tri-, tetra-, and penta- nucleotide SSR loci in the chloroplast genome of A. membranaceus.

Seven forward repeats were identified using REPuter with a size cutoff of 30 bp (Table 3). The longest forward repeat unit was 114 bp long and was located in the intergenic region of trnN-GUU and ycf1. Six tandem repeats longer than 30 bp were identified, and the similarities among these repeat units were >90%. All of these tandem repeats were located in the intergenic regions (Table 3).

Table 3 Repeat sequences identified in the chloroplast genome of A. membranaceus.

Presence of hypermutation regions in A. membranaceus chloroplast genome

The initial whole genome de novo assembly revealed seven scaffolds labeled as A, B, C, D, E, F, and G. To close these gaps, we designed seven sets of primers spanning the adjacent scaffolds. PCR products were easily obtained using the primer pairs spanning the gaps between scaffolds A and B, B and C, as well as D and E (Fig. S2); however, DNA sequencing for these three PCR products could not generate high-quality DNA sequences. Manual examination of the trace files suggested the presence of multiple and similar, but non-identical, sequences in these PCR products (Fig. S2). In particular, the quality of the sequences in these PCR products significantly dropped after the poly A/T stretches, which are located in the intergenic regions between the genes trnF-GAA and trnT-UGU (region AB, bases 14421–15192), psbK and trnQ-UUG (region BC, bases 53416–54021), and rpl33 and rps18 (region DE, bases 65175–65575). The start and end positions of these regions were determined by the 3′ ends of the corresponding PCR primers used for their amplification. These regions probably contained low complexity sequences of variable length.

To determine the exact structure of these polymorphic regions, DNA from four plant individuals, named i1, i5, i6 and i7 were extracted. PCR amplification was performed and the PCR products were cloned. Ten positive clones for each PCR product were selected and sequenced. The sequences of all fragments with high quality were aligned with MegAlign (DNASTAR, WI) using the CLUSTALW2 algorithm (Fig. S3). Five variable loci: vl1, vl2, vl3, vl4 and vl5 are shown in Fig. 2A–E respectively. The name of each sequence follows the format [name of genome region]-[id of plant individual]-[clone id]-[primer direction]. For the locus vl1 (Fig. 2A), an extra copy of “TATATATTTA” repeat was found in i1, which were absent in i5 and i6. In i7, sequences from one out of three clones (AB-i7-c19) contain the extra copy “TATATATTTA”. In contrast, the sequences from the other two clones AB-i7-c13 and AB-i7-c18 did not have the extra copy. For the locus vl2 (Fig. 2B), we observed a single nucleotide insertion and deletion in the sequences from clones AB-i6-c8 and AB-i7-c18, respectively compared to the consensus sequences. It is noted that this region is rich in “A”. For the locus vl3 (Fig. 2C), a single nucleotide deletion was observed in the sequences from one clone of i6 (BC-i6-c19). For the locus vl4, (Fig. 2D), an extra copy of “TATATTATA” was observed in all sequences of i1, i6 and i7 comparing to those of i5, which is the repeat unit between genes rpl33 and rps18. For the locus vl5 (Fig. 2E), there was an insertion of a single nucleotide “A” in the sequences from all clones of i7. All five loci are intraspecific variations. Among them, vl1, vl2 and vl3 are also heteroplasmic. These intra-specific loci represent markers that can potentially be used to distinguish closely related varieties of Astragalus membranaceu.

Figure 2: Alignment of sequences from the PCR products for the identification of highly polymorphic regions in the A. membranaceus chloroplast genomes.
figure 2

Panels (A,B) show the sequences obtained from the region AB. Panel (C) is for the region BC. Panels (D,E) show the sequences obtained from the region DE. The ID of each sequence is shown on the left side of each panel. The ID is the concatenation of region name, plant individual id, clone ID and primer direction (F: forward; R: reverse).

Phylogenetic analysis of A. membranaceus based on conserved protein sequences

To determine the phylogenetic position of A. membranaceu of in Papilionoideae, 37 complete chloroplast genome sequences were obtained from the RefSeq database (Table S5). Nicotiana tabacum and Arabidopsis thaliana were included in the analysis as the outgroup taxa. The other 35 species belong to Cicereae (1), Dalbergieae (1), Fabeae (1), Galegeae (1), Indigofereae (1), Loteae (1), Millettieae (1), Robinieae (1), Genisteae (2), Trifolieae (9), and Phaseoleae (15) respectively. The number shown in the parenthesis represents the number of species in the corresponding clade. To conduct phylogenetic analysis, we extracted 67 protein sequences, which were present among all the 38 chloroplast genomes. There were a total of 18515 positions in the final dataset. Results showed that A. membranaceus is the closest sister species of Glycyrrhiza glabra and Cicer arietinum with bootstrap values of 100% (Fig. 3). The symbols next to each species represent genes that were found lost. More details on gene losses are shown in Table 4. Overall, the patterns of gene loss are consistent with the tree topology with a few exceptions. For example, ycf4 was found lost in V. unguiculata, but not in the closely related species V. angularis and V. radiate. In addition, ycf4 was found lost in T. boissieri, but not in the closely related T. grandiflorum and T. aureum. These findings suggest that the loss of ycf4 occurred after the geneses of Vigna and Trifolium species.

Figure 3: Molecular phylogenetic analysis of the Papilionoideae subfamily.
figure 3

The tree was constructed with the sequences of 67 proteins present in all 38 species (Lupinus albus, Lupinus luteus, Robinia pseudoacacia, Lotus japonicus, Glycyrrhiza glabra, Astragalus membranaceus, Cicer arietinum, Pisum sativum, Lathyrus sativus, Trifolium repens, Trifolium meduseum, Trifolium subterraneum, Trifolium glanduliferum, Trifolium strictum, Trifolium aureum,Trifolium grandiflorum, Trifolium boissieri, Medicago truncatula, Indigofera tinctoria, Apios americana, Vigna angularis, Vigna radiata, Vigna unguiculata, Phaseolus vulgaris, Pachyrhizus erosus, Glycine max, Glycine soja, Glycine cyrtoloba, Glycine stenophita, Glycine tomentella, Glycine syndetika, Glycine canescens, Glycine dolichocarpa, Glycine falcata, Millettia pinnata, Arachis hypogaea, Arabidopsis thaliana, Nicotiana tabacum), using the Maximum Likelihood method implemented in RAxML. Two taxa, Nicotiana tabacum and Arabidopsis thaliana were used as outgroups. The tribes, to which each species belongs, are shown to the right side of the tree. Bootstrap supports were calculated from 1000 replicates. Genes lost in a particular branch were indicated with the following symbols: (rps16), ▲(ycf4), (accD), •(rpl23) and (rpl33).

Table 4 Gene losses in the chloroplast genomes of the Papilionoideae subfamily.

Frequent inversions in the chloroplast genomes of Papilionoideae

To identify the possible occurrence of genome rearrangement, the chloroplast genome sequences of A. membranaceus, N. tabacum and 12 other species belonging to Papilionoideae were selected for synteny analyses. These 12 species include C. arietinum, Arachis hypogaea, Lathyrus sativus, G. glabra, Lupinus luteus, Indigofera. tinctoria, Lotus japonicus, Millettia pinnata, Glycine max, Robinia pseudoacacia, Medicago truncatula, and Trifolium aureum, which are members of the tribes Cicereae, Dalbergieae, Fabeae, Galegeae, Genisteae, Indigofereae, Loteae, Millettieae, Phaseoleae, Robinieae, and Trifolieae, respectively (Figs 4 and 5).

Figure 4: Synteny analyses of chloroplast genomes from A. membranaceus and N. tabacum.
figure 4

(A) Global synteny view; LSC region, the IRa and IRb and SSC regions are shown at the bottom of the alignment. I, II, III, and IV represent the border regions of the two inversions (enlarged and shown below); (B) Detailed alignments of the border regions of two inversions between A. membranaceus and N. tabacum. The coding regions of genes are represented by lines below the synteny maps, with their names shown on top of the lines. Blue and red colors indicate that the genes are transcribed clockwise and counterclockwise, respectively. The genes lost in A. membranaceus are enclosed in parentheses.

Figure 5: Comparative genomic analyses of thirteen chloroplast genomes.
figure 5

The chloroplast genome of A. membranaceus was aligned with those of twelve species. Each horizontal black line represents a genome. The species names are shown to the right of the corresponding line. The conserved regions are bridged by lines. The numbers on the right of each panel indicates the group number to which the chloroplast genomes have been assigned.

Two inversions are readily discernible between chloroplast genomes of N. tabacum and A. membranaceus (Fig. 4A). The genes at the enlarged inversion boundaries (I, II, III and IV) are shown in Fig. 4B. A large inversion of 50 kb, which is apparently shared by the majority of papilionoid legumes21, is located between the rps16 (Fig. 4B–I) and rbcL genes (Fig. 4B–II). Similarly, the other notable inversion of 20 kb is located between the ndhF (Fig. 4B–III) and ycf1 genes (Fig. 4B–IV). This inversion has also been found in other species such as G. glabra, M. truncatula and C. arietinum (Fig. 5)

The 12 species were classified into seven groups based on the degree of genome conservation relative to the A. membranaceus chloroplast genome. The first group includes C. arietinum, M. truncatula, and G. glabra. The gene order of the chloroplast genomes of this group is highly conserved compared with that of A. membranaceus (Fig. 5A–C). Particularly, these chloroplast genomes had only one copy of the IR. The second group includes G. max (Fig. 5D), whose chloroplast genome structure is similar to that of A. membranaceus, except for the presence of two copies of the IR. The third group includes A. hypogaea, L. japonicus, and I. tinctoria whose genomes contain the 20 kb inversions in the SSC region and two copies of the IR (Fig. 5E–G). The fourth group includes M. pinnata, whose genome contains not only the 20 kb inversion in the SSC region but also one small inversion in the LSC region (Fig. 5H). The fifth group includes L. luteus, whose genome contains the 50 kb inversion in the LSC region (Fig. 5I). The sixth group includes R. pseudoacacia, whose genome includes two large inversions. One is of approximately 50 kb long in the LSC region and the other one is 20 kb long in the SSC regions (Fig. 5J). All chloroplast genomes of the second to sixth groups have two copies of the IR. The seventh group includes two IRLC species, namely, L. sativus and T. aureum whose chloroplast genomes contain numerous inversions (Fig. 5K,L). These results suggested that inversions frequently occurred in the evolution of Papilionadeae.

Comparative analyses of the gene losses among the chloroplast genomes in Papilionoideae

The loss of genes in the chloroplast genomes of Papilionoideae was then analyzed in detail (Table 4). The species names were order based on that shown in Fig. 4. And the gene names were ordered based on the number of species in which the gene was found lost. The rpl22 gene was absent in all 36 chloroplast genomes of Papilionoideae. In addition, rps16 gene was not found in the chloroplast genomes of 21 completely sequenced Papilionoideae species, including all IRLC species. Moreover, the loss of ycf4 gene was observed in 16 chloroplast genomes. The loss of accD was observed in six Trifolium genomes. Loss of the rpl33 and rpl23 genes occurred in four chloroplast genomes (P. vulgaris, V. radiata, V. unguiculata, and V. angularis) and in two chloroplast genomes (P. sativum and L. sativus), respectively. The losses of ndhD, psaI, rps18, and rps19 were only found in V. angularis, L. sativus, T. subterraneum, and R. pseudoacacia, respectively. The most frequently lost genes rps16, ycf4 and rpl33 were found to locate at the boundaries of the 50 kb inversion, suggesting that their losses might be related to the genesis of this 50 kb inversion. The patterns of gene loss were found to be largely consistent with the topology of the phylogenetic tree (Fig. 3).

Discussion

In the present study, we have: (1) sequenced the chloroplast genome of A. membranaceus; (2) annotated the chloroplast genome; (3) identified SSR and tandem repeats of the genome; (4) carried out a phylogenetic analysis of the 38 chloroplast genomes based on 67 conserved proteins; (4) compared the structures of 13 chloroplast genomes in Papilionoideae; (5) identified genes that have been lost among the 36 chloroplast genomes in Papilionoideae subfamily; and (6) identified five hypermutation loci that can potentially serve as markers to distinguish A. membranaceus varieties. Our results have laid the foundation for future studies on the evolution of chloroplast genomes of legumes, as well as the molecular identification of A. membranaceus varieties.

PCR products with primers spanning the targeted gaps are directly obtained during gap filling; however, obtaining sequencing results of good quality in these three regions was difficult. After checking the trace files, we hypothesize that this regions largely contains low-complexity sequences and might be highly polymorphic. DNA samples from four plant individuals were extracted. The corresponding regions were amplified. The PCR products were then cloned and sequenced. The results revealed five hypermutative regions, which contain variable repeat numbers or single nucleotide indels (Fig. 2). Furthermore, variations were also observed among sequences derived from the same plant individual (vl1, vl2 and vl3), a manifestation of heteroplasmy. This finding also explains why the de novo genome assembly program failed to assemble the genome at these regions in the first place.

Moreover, the current study demonstrated high degree of diversity in the structure of legume chloroplast genomes. Genome organization and gene content of chloroplast genomes is believed to be highly conserved in most angiosperms22. With the increasing number of chloroplast genome sequences, the diverse organization of chloroplast genome is becoming more evident, as demonstrated by the extensive genome rearrangement and gene losses in the chloroplast genomes of the legume family. For example, all members of the Carmichaelieae, Cicereae, Hedysareae, Trifolieae, Fabeae (Vicieae), Galegeae tribes, and three genera of Millettieae contain only one copy of the IR and are thereby assigned as belonging to the IRLC15. Furthermore, the losses of rpl22, rps16, and ycf4 have been reported in various chloroplast genomes15. These genomic rearrangements combined with variations at the gene structural levels provided valuable information to resolve relationships among several deep nodes of legumes21,23,24,25.

Whether or not there are any links among hypermutation, inversion and gene loss is an interesting question. Compared with that of N. tabacum, two large inversions have been identified in the A. membranaceus chloroplast genome (Fig. 4A). The 50 kb inversion was identified in the LSC region between rps16 and rbcL in N. tabacum (Fig. 4B), while rps16 was absent in the A. membranaceus chloroplast genome. From a systematical analysis of gene losses in 36 other species (Table 4), it is found that three of the most frequently lost genes, namely, accD, rps16, and ycf4, are located at the boundaries of the 50 kb inversion. While one of them, the ycf4 has not been lost in A. membranaceus, its loss has been found in Lathyrus odoratus and three other groups of legumes. Particularly, each of the four consecutive genes ycf4-psaI-accD-rps16 has been lost in at least one member of the legume’s IRLC25. In contrast to the 50 kb inversion, gene losses were not observed at the boundaries of the 20 kb inversion. Hypermutation has been implicated in gene loss before. For example, a 1.5 kb long region of chloroplast DNA in plants related to sweetpea (Lathyrus) was found to be coincides with ycf4, whose local point mutation rate is at least 20 times higher than elsewhere in the same molecule25. In A. membranaceus, the three hypermuation regions found are not adjacent to any of the inversions. Taking together, while the inversions and gene losses are likely to be associated based on their adjacency in A. membranaceus, the relationship between inversion and hypermuation is not evident.

In the future, we plan to apply the same approach to sequence and analyze more chloroplast genomes from A. membranaceus varieties. Comparative analyses will likely provide insight into the chloroplast genome evolution of A. membranaceus varieties. Furthermore, detailed characterization of the highly polymorphic regions is another interesting direction. Samples from individual plants belonging to different varieties of A. membranaceus can be collected. Primers specific to these regions can be used to amplify these regions for sequencing. Alignment of these sequences can be used to determine the degree of variations at the individual, population, variety, and species levels. This information will facilitate the establishment of an effective DNA barcoding-based identification method and provide valuable markers to study the population genetics of A. membranaceus.

Methods

Plant material and chloroplast DNA purification

Fresh leaves of A. membranaceus from multiple individuals were collected from the fields of Institute of Medicinal Plant Development, Beijing, China and stored at 4 °C for chloroplast genomic DNA isolation. Chloroplasts were isolated from approximately 100 g fresh leaves using the high salt saline plus Percoll gradient method described before26. Subsequently, chloroplast DNA was extracted from the purified chloroplasts, and the chloroplast DNA purity was evaluated with 1.0% agarose gel, whereas DNA concentration was measured using a Nanodrop spectrophotometer 2000 (Thermo Fisher Scientific, America).

Chloroplast genome sequencing, assembly and gap filling

Approximately 50 ng of chloroplast DNA was sheared to yield approximately 500 bp long fragments for paired-end library construction according to the manufacturer’s instructions (Illumina Inc., San Diego, CA). The library was sequenced on Illumina HiSeq 2000 (Illumina Inc.). In total, 15,000,362 paired-end reads (2 × 100 bp) were obtained.

To identify a reference genome to assist the assembly, we first downloaded 27 chloroplast genomes belonging to the Papilionoideae from GenBank in December 2014. These chloroplast genome sequences were used to search against Illumina paired-end reads using BLASTN with an E-value cutoff of 1e-5. The genome sequence of G. glabra (Accession number: NC_024038) had the highest overall sequence similarity to the reads and was used as a reference for the downstream genome assembly.

AbySS (v1.5.2)27 was used for the De novo genome assembly. Different k-mer sizes were tested. The k-mer size of 64 gave the best results in terms of the smallest numbers of scaffolds and the longest average length of scaffolds. And this parameter was used to generate the final assembly.

The resulting contigs were compared against the chloroplast genome sequence of G. glabra using BLASTN with an E-value cutoff of 1e-5. Seven large contigs were identified and were temporarily arranged based on their mapping positions on the reference genome. Moreover, primers were designed based on the sequences at the ends of the adjacent contigs. PCR amplification and subsequent DNA sequencing were used to fill the gaps. PCR amplifications were performed using the sequence specific primers (Table S6) under the following conditions: predenaturation at 94 °C for 2 min, 35 cycles of amplification at 94 °C for 30 s, 55 °C for 30 s and 72 °C for 30 s, followed by a final extension at 72 °C for 2 min. The PCR reaction mixture contained 25 μl of Taq MasterMix (2×), 2 μl of forward primer (10 μM), 2 μl of reverse primer (10 μM), purified chloroplast DNA (<1 μg). RNase-free water was added to the final reaction volume of 50 μl.

The correctness of the assembly was validated further by mapping all raw sequence reads to the assembly using Bowtie 2 (v2. 0.1) program28 with the default settings. Manual examination of the coverage of the entire assembly was performed using Tablet (v1.14.10.20)29. The primer sequences are listed in Table S6.

Genome annotation and codon usage analyses

The CpGAVAS web service30 was used to annotate the A. membranaceus chloroplast genome. Cutoffs for the E-values of BLASTN and BLASTX were 1e-10. The number of top hits to be included in the reference gene sets for annotation after the pre-filtering step was 10. Meanwhile, tRNA genes were identified using tRNAscan-SE31 and ARAGORN32. Manual corrections on the positions of the start and stop codons, and for the intron/exon boundaries were performed based on the entries in the Chloroplast Genome Database33 using the Apollo program34. Moreover, the circular chloroplast genome map of A. membranaceus was drawn using OrganellarGenomeDRAW35. Furthermore, codon usage and GC content were analyzed using the Cusp and Compseq programs provided by EMBOSS36. Final genome assembly and genome annotation results were deposited in the GenBank (accession number: KU666554).

Repeat sequence analysis

SSRs were detected using MISA Perl Script available at (http://pgrc.ipk-gatersleben.de/misa/), with the following thresholds: 8 repeat units for mononucleotide SSRs, 4 repeat units for di- and trinucleotide repeat SSRs, and 3 repeat units for tetra-, penta-, and hexanucleotide repeat SSRs. Tandem repeats were analyzed using Tandem Repeats Finder37 with parameter settings of 2 for matches and 7 for mismatches and indels. The minimum alignment score and maximum period size were set at 50 and 500, respectively. All the identified repeats were manually verified and nested or redundant results were removed. REPuter38 was employed to identify the IRs in A. membranaceus by forward vs. reverse complement (palindromic) alignment. The minimal repeat size was set at 30 bp, and the cutoff for similarities among the repeat units was set at 90%.

Phylogenetic analysis

A total of 37 complete chloroplast DNA sequences belonging to the Papilionoideae subfamily were obtained from RefSeq database (Table S5). For the phylogenetic analysis, 67 protein sequences shared among all these 37 species and A. membranaceus were aligned using the CLUSTALW2 (v2.0.12) program. The 67 proteins are ATPA, ATPB, ATPE, ATPF, ATPH, ATPI, CCSA, CEMA, CLPP, MATK, NDHA, NDHB, NDHC, NDHE, NDHF, NDHG, NDHH, NDHI, NDHJ, NDHK, PETA, PETB, PETD, PETG, PETL, PETN, PSAA, PSAB, PSAC, PSAJ, PSBA, PSBB, PSBC, PSBD, PSBE, PSBF, PSBH, PSBI, PSBJ, PSBK, PSBL, PSBM, PSBN, PSBT, PSBZ, RBCL, RPL14, RPL16, RPL2, RPL20, RPL36, RPOA, RPOB, RPOC1, RPOC2, RPS11, RPS12, RPS14, RPS15, RPS2, RPS3, RPS4, RPS7, RPS8, YCF1, YCF2 and YCF3 (Supplementary file 1). The alignment was manually examined and adjusted. Then, the evolutionary history was inferred using the Maximum Likelihood method implemented in RaxML (v8.2.4)39. The detailed parameters were “raxmlHPC-PTHREADS-SSE3 -f a -N 1000 -m PROTGAMMACPREV -x 551314260 -p 551314260 -o A_thaliana,N_tabacum -T 20”. The tree with the highest log likelihood (−233993.753326) was shown. The significance level for the phylogenetic tree was assessed by bootstrap testing with 1000 replications. Only branches supported by bootstrap values >50% are shown.

Comparative genome analysis

Conserved sequences were identified between the chloroplast genomes of A. membranaceus and those of N. tabacum (NC_001879), C. arietinum (NC_011163), A.hypogaea (NC_026676), L. sativus (NC_014063), G. glabra (NC_024038), L. albus (NC_023090), I. tinctoria (NC_026680), L. japonicus (NC_002694), M. pinnata (NC_016708), G. max (NC_007942), R. pseudoacacia (NC_026684), M. truncatula (NC_003119), and T. aureum (NC_024035) using BLASTN with an E-value cutoff of 1e-10. The homologous regions and gene annotations were visualized using a web-based genome synteny viewer GSV40.

Examination of hypermutation regions in A. membranaceus chloroplast genome by PCR amplification, PCR product cloning and DNA sequencing

To determine the structure of the likely hypermutation regions in A. membranaceus chloroplast genome, the total DNA of four A. membranaceus individuals were extracted independently using the plant genomic DNA kit (Tiangen Biotech, Beijing) and subjected to PCR amplification using the PrimeSTAR max DNA polymerase (Takara Bio, Japan), a high fidelity polymerase. The primers specific for the gaps between scaffolds A and B, B and C, as well as D and E were used (Table S6). The PCR reactions were performed under the following conditions: pre-denaturation at 95 °C for 1 min, 40 cycles of amplification at 98 °C for 10 s, 53 °C for 15 s and 72 °C for 10 s, followed by a final extension at 72 °C for 2 min. The PCR products were purified with TIANquick Midi Purification Kit (Tiangen Biotech) and cloned using Lethal Based Fast Cloning Kit (Tiangen Biotech). For each region from each individual plant, 10 positive clones were selected and sequenced by Sanger method. A total of 120 clones were sequenced in both the forward and reverse direction by Sinogenomax Co., Ltd (Beijing).

Additional Information

How to cite this article: Lei, W. et al. Intraspecific and heteroplasmic variations, gene losses and inversions in the chloroplast genome of Astragalus membranaceus. Sci. Rep. 6, 21669; doi: 10.1038/srep21669 (2016).