Comparison of eight complete chloroplast genomes of the endangered Aquilaria tree species (Thymelaeaceae) and their phylogenetic relationships

Aquilaria tree species are naturally distributed in the Indomalesian region and are protected against over-exploitation. They produce a fragrant non-timber product of high economic value, agarwood. Ambiguous species delimitation and limited genetic information within Aquilaria are among the impediments to conservation efforts. In this study, we conducted comparative analysis on eight Aquilaria species complete chloroplast (cp) genomes, of which seven were newly sequenced using Illumina HiSeq X Ten platform followed by de novo assembly. Aquilaria cp genomes possess a typical quadripartite structure including gene order and genomic structure. The length of each of the cp genome is about 174 kbp and encoded between 89 and 92 proteins, 38 tRNAs, and 8 rRNAs, with 27 duplicated in the IR (inverted repeat) region. Besides, 832 repeats (forward, reverse, palindrome and complement repeats) and nine highly variable regions were also identified. The phylogenetic analysis suggests that the topology structure of Aquilaria cp genomes were well presented with strong support values based on the cp genomes data set and matches their geographic distribution pattern. In summary, the complete cp genomes will facilitate development of species-specific molecular tools to discriminate Aquilaria species and resolve the evolutionary relationships of members of the Thymelaeaceae family.

Scientific RepoRtS | (2020) 10:13034 | https://doi.org/10.1038/s41598-020-70030-0 www.nature.com/scientificreports/ Flora (CITES) has listed all Aquilaria species under Appendix II, as one of the countermeasures to reduce illegal agarwood trade 2,4 . Agarwood is chiefly traded in wood form or consumer products, yet Aquilaria species delimitation is based on the plant's botanical characteristics, with much emphasis given to the reproductive parts specifically flower and fruit 5 . Incomplete distinct morphological characteristics render genetic detection tools indispensable during agarwood identification process. Recent approaches have utilized short DNA sequences to identify Aquilaria species and species-origin of agarwood [6][7][8][9] . Short cp gene sequences have been adequate in barcoding some land plants 10 , unfortunately it has been shown that with Aquilaria, highly evolved DNA barcodes are required. Several DNA barcoding loci have been tested in Aquilaria, yet the discrimination power across a majority of the Aquilaria species is insufficient 7 .
A recent phylogenetic analysis utilizing five non-coding cp DNA regions from 15 species (Aquilarieae) yielded inconclusive resolution due to the high percentage of conserved sites 11 . Meanwhile, in the same study, nuclear ribosomal DNA internal transcribed spacer (ITS) revealed a paraphyletic relationship between Aquilaria species from Indochina and Malesian. It has been suggested that genome-scale data could provide a better resolution and thorough information of a studied genus pertaining to taxonomical aspects, genetic diversity, and pattern of evolution 11 .
Chloroplast is an important organelle contributing to the growth development of most plants. They play important roles in photosynthesis and carbon fixation 12 . Chloroplast or plastid originally derived from a freeliving photosynthetic prokaryote. Present-day chloroplast arose from a cynobacterial endosymbiont 13 . Chloroplasts still exhibit many prokaryotic characteristics such as having a circular DNA, reproducing in a similar way as bacteria through division, and importing nuclear encoded proteins through thylakoids 14 . In general, the cp genomes in angiosperms are circular DNA molecules with a highly conserved region, gene content, and gene order, and the standard cp genome size is ranged between 120 and 170 kbp in length 15 . A typical cp genome consists of a pair of inverted repeats (IR) region that is separated by a large single copy (LSC) region and a small single copy (SSC) region 16 . The advancement of high-throughput sequencing technology has amassed thousands of complete cp genomes from various land plants in just several years 17 . A total of 23,867 complete cp genomes of land plants have been sequenced (NCBI, https ://www.ncbi.nlm.nih.gov/genom e/organ elle/, accessed on 12 February 2020). The maternal inheritance in the cp genome has provided an exclusive and substantial information for plant systematics and evolutionary relationships 18 . Cp genomes have been used in species identification, phylogeny and population genetic analyses 17,19,20 . Potential markers can be developed through cp genomes analysis for identification of plant species, particularly for the taxonomically complicated groups 10,21 .
Due to the small genome size, high interspecific and low intraspecific divergence, and ease of handling, cp genome is an attractive alternative to provide more variations in discriminating closely related plants 10 . To date, research on Aquilaria genomics has yielded a draft whole genome of A. agallocha 22 and complete sequences of six cp genomes, one each from A. crassna 23 , A. malaccensis 24 , and A. yunnanensis 25 , and three from A. sinensis 18,26,27 . In this study, we report complete cp genome sequences of seven Aquilaria species and incorporate these new sequences together with another published cp genome from our group (A. malaccensis) 24 in all our analyses. In addition, we also completely sequenced the cp genomes of Gonystylus affinis and Phaleria macrocarpa and retrieved complete cp sequences of another six species, all of which are under the Thymelaeaceae family, to determine their molecular placement within the phylogenetic tree. Collectively, we provide a rich genomic resource to better understand Aquilaria, which may help facilitate the conservation efforts of these endangered species.

Materials and methods
Sample materials. Fresh leaf samples were collected from individual trees growing in the greenhouse of the Faculty of Forestry and Environment, Universiti Putra Malaysia (UPM), Serdang, Selangor, Malaysia, and the Aquilaria germplasm of Forest Research Institute of Malaysia (FRIM), Kepong, Selangor, Malaysia. For comparative analysis, seven species, A. beccariana, A. hirta, A. microcarpa, A. rostrata, A. crassna, A. sinensis, and A. subintegra, were sequenced. The four former species are native to Malaysia, while the following three are introduced plantation species in the country. For phylogenetic analysis, the cp genome of G. affinis and P. macrocarpa, two close relatives of Aquilaria, were also sequenced.
DnA extraction and sequencing. A total of 100 mg fresh leaves was pulverized into powder using mortar and pestle, with the aid of liquid nitrogen. Total genomic DNA was extracted using a modified cetyltrimethylammonium bromide (CTAB) method 28 . The quantity and quality of the DNA samples were determined using the Qubit dsDNA BR assay (Life Technologies, Carlsbad, CA, USA) using the manufacturer's instructions. DNA samples were fragmented using sonication, purified and end-repaired, and their sizes were determined by gel electrophoresis and the size of fragments were between 200 to 500 bp. A genomic library with an insert size of 300 bp was prepared using TruSeq DNA Sample Prep Kit (Illumina, CA, USA) and next-generation sequencing was conducted on a HiSeq X Ten platform (Illumina, USA). chloroplast genome assembly and annotation. Approximately 8 Gb of raw data that consisted of 150-bp paired-end reads were generated and the sequence adaptors for the raw reads were trimmed off using the base quality control software NGS QC Toolkit v2.3.3 29 . The cp genome was assembled using NOVOPlasty v3.8.2 30 , with the rbcL sequence of A. yunnanensis (KR528756) as the seed sequence. The assembled cp genome sequence was annotated using online annotation tool GeSeq 31 , and further compared manually against A. yunnanensis cp genome (MG656407). The circular cp genome maps were visualized using OGDRAW v1.3.1 32 .  34 to identify nucleotide diversity in the total genome, LSC, SSC and IR regions. The boundaries between the IR and SC regions were further evaluated manually to examine the differences in length variation in the cp genomes of Aquilaria.
Repeat structure analysis and identification of highly variable regions. Repeat sequences as well as forward (F), reverse (R), complement (C) and palindrome (P) sequences were identified using REPuter 35 , with the maximum and minimum repeat size set at 50 and 30, respectively, and Hamming distance ≤ 3. To identify highly variable regions, polymorphic sites and nucleotide variability (Pi) in the eight MAFFT aligned cp genomes were evaluated using a sliding window analysis available in DnaSP v5.10.01, under a 200-bp step size and a 600-bp window length. The regions that contain the number of polymorphic sites that are more than the sum of the average and double the standard deviation are regarded as highly variable regions in the cp genome 34 39 and Eucalyptus grandis (HM347959) 40 were used as outgroups. Sequences were aligned using MAFFT v7 33 with default settings (strategy of FFT-NS-2). Phylogenetic analyses were subsequently performed using Maximum likelihood (ML) and Bayesian inference (BI) methods. Maximum likelihood (ML) analyses were performed using IQ-TREE v.1.4.2 41 with branch support estimated using 2,000 replicates of both SH-like approximate likelihood-ratio test (SH-aLRT) 42 and the ultrafast bootstrapping algorithm (UFboot) 43 . The Mod-elFinder option was used to identify the optimal partitioning scheme and substitution models 44 , in which the DNA substitution model that is most suitable for our dataset was transversion model (TVM) with empirical base frequencies (+ F) and discrete Gamma model with default 4 rate categories (+ G4) (= TVM + F + G4). The phylogenetic tree was rooted using E. grandis and visualized using Figtree v1.4.4 45 . Bayesian inference (BI) analyses were performed using the program MrBayes v3.2.7 46 . Markov chain Monte Carlo (MCMC) simulations were run twice independently for 2 million generations, and sampling trees every 100 generations. Convergence was determined by examining the average standard deviation of split frequencies (≤ 0.01). The first 25% of trees was discarded as burn-in, and the remaining trees were used to build a majority-rule consensus tree.

Results and discussion
chloroplast genome sequencing. Approximately 60,000,000 raw reads were obtained for each species sequenced using the HiSeq X Ten system. Raw reads were inserted directly into the pipeline without filtering or quality trimming to obtain maximum useful data. To accelerate the assembly of plastid genomes, we selected only the first 13.6 million sequences of each paired-end data, yielding with a total of 15.  Table 1). All eight Aquilaria cp genomes share a typical quadripartite structure composed of a pair of IRs known as IR A and IR B , and a single LSC and SSC (Fig. 1). In addition, the gene content and order are highly similar. This agrees with the consensus that the genomic structure in cp genomes of angiosperms is highly conserved 15,17 Table 2). The GC content of the Aquilaria species in our study is similar to that reported in A. yunnanensis (38%) 25 and A. sinensis (36.7%) 26 . The contraction and expansion of IR region boundaries are considered the primary mechanism that affects the varying lengths in angiosperm cp genomes, as demonstrated in Apiales 49 and Trochodendraceae 50 . However, in this study, variations were in fact detected at the LSC/IR A , IR A /SSC, SSC/ IR B , and IR B /LSC border regions of the Aquilaria cp genomes (Fig. 3). When comparing the boundary (IR/SC) regions between Aquilaria species and two of their Thymelaeaceae relatives, S. chamaejasme (MK681211) and D. kiusiana (KY991380), they all share highly identical genes at the border junctions. The number of encoded functional genes from the species we sequenced ranged from 135 to 138 (Table 1), which are not significantly different from the 137 reported in A. sinensis 26 (Table 2). Presence or absence of a specific gene from several Aquilaria species and not others could be due to the gene being transferred to the nucleus. Gene transfer events have been observed in gene knockout experiments, such as the rps18 in tobacco 52 . This event most likely happened when plants are exposed to biotic and abiotic stresses, which lead to the inducing accumulation of reactive oxygen species (ROS), which activates signalling pathway when at low levels, but can cause irreparable injury to cells when produced excessively 53,54 . ROS is a normal product of plant cellular metabolism that can be affected by various types of stress 55 . ROS generated in chloroplasts can also act as signals that travel from the cp to the nucleus under stress conditions 56 . Since signals are moving from the chloroplast to the nucleus under stress condition, these transfers may also promote the transfer of chloroplast genome fragments to the nucleus where they could be incorporated into the nuclear genome 54 . Consequently, it may assist the transfer of cp gene to the nucleus 54 . When comparing the eight cp genomes for base/nucleotide composition, in LSC, A. malaccensis has the highest percentage of A (34%) and G (18%) nucleotides, but the lowest percentage in T (32%) and C (16%) nucleotides (Fig. 2). In SSC, A. sinensis has the highest percentage of A (41%) nucleotide, while A. crassna and A. hirta have the highest percentage of T (32%) and C (16%) nucleotides, respectively. The overall A + T content is more than 50% when compared to G + C content (Fig. 2). This study shows that Aquilaria cp genomes have high levels of A + T content, a feature generally observed in many cp genomes sequences of angiosperm species 57 . interspecies chloroplast genome sequence analysis. Multiple cp genome sequence alignment of the eight Aquilaria cp genomes with a total of 174,832 nucleotide sites revealed 697 variable (polymorphic) sites including 405 singleton variable sites (SVS) and 292 parsimony informative sites (PIS) ( Table 3). There are two different categories under SVS, 403 sites with two variants (SV2V) and two sites with three variants (SV3V). Similarly, the PIS also has two variants (PIS2V) (288 sites) and three variants (PIS3V) (4 sites) ( Table 3 (Table 4). In the PIS2V category, A. hirta has the highest number of PIS for nucleotide A (81), while A. crassna for nucleotide T (90). In summary, most of the variable sites were identified in A. hirta, A. rostrata and A. sinensis ( Table 4). The information on SVS and PIS are useful for species identification studies and for determining phylogenetic relationships 58,59 . iR contraction and expansion. Close examination of the IR/SC boundary regions among the eight Aquilaria species revealed three main differences (Fig. 3). Firstly, the rps19 gene (284 bp) is extended beyond the LSC into the IR A region by 15 bp in all species. Secondly, the ndhf gene spans the IR A /SSC border, between 25 to 28 bp in the IR A region and 2,211 bp in the SSC region, except in A. microcarpa, where it is completely in the SSC region, distanced by 6 bp from the IR A region. Thirdly, in all the eight Aquilaria species, the rpl32 and trnL genes are in the SSC region and IR B region, respectively, however with slight differences in the distance to or from the ISSC/R B border. No differences were observed in the IR B /LSC border region; the rpl2 gene is in the IR B region Table 1. Summary of the assembly data of eight Aquilaria chloroplast genomes. Data were extracted from the complete chloroplast genomes sequenced in this study and the available A. malaccensis sequences (GenBank accession number MH286934).

Aquilaria beccariana
Aquilaria www.nature.com/scientificreports/ and the trnH is in the LSC region. In general, the IR region is one of the main reasons for a change in the size of the cp genome due to expansion, shrinkage and loss of the IR 60 .
Large sequence repeat analyses. The large sequence repeat (LSR) of the eight Aquilaria cp genomes were analyzed using REPuter software. A total of 832 repeats (at least 30 bp per repeat unit with Hamming distance = 3), including forward (F), reverse (R), palindromic (P) and complement (C) repeats were identified ( Table 5, and Supplementary Tables 1 to 31). In general, F repeats are the most common type detected in the Aquilaria cp genomes, while C repeats are the least. Among the eight species, C repeats are absent from A. rostrata, although it has 50 F and 2 R repeats, and 48 P repeats (Fig. 4). Large repeat sequences are informative for phylogenetic studies of Aquilaria species as they play important roles in cp genome evolution and may aid in future development of molecular markers 61 .   www.nature.com/scientificreports/ Frequent variation in repeat regions in most angiosperm plants occurs due to slipped-strand mispairing and illegitimate recombination. Frequent variation in repeat regions also plays an important role in variation and sequence rearrangement in cp genomes 20,62 . In addition, the quantity of the identified repeats is sensitive to the Hamming distance used. For example, when we cut the Hamming distance from 3 to 1 (in other words rigidity was augmented), the number of repeat sequences was lowered ( Table 5).

Identification of highly variable regions within the Aquilaria cp genomes.
Using the alignment created by MAFFT and DnaSP software, the nucleotide variability (Pi) values within 600 bp window were calculated in all eight cp genomes. They are in the range from 0 to 0.01370 (Fig. 5). There are nine highly divergent regions (Pi > 0.005), divided between the intergenic spacer (IGS) region (trnD-trnY, trnT-trnL, trnL-trnF, trnF-ndhJ, trnV-trnM) and the coding sequence (CDS) regions (matK-rps16, rpoC1-rpoC2, petA-cemA and rpl32) (Fig. 5). In total, there are 144 variable sites, 72 parsimony informative sites and Pi values from 0.00630 to 0.01370, in the nine regions (Table 6). Among these, rpl32 has the most nucleotide variation (0.01370). Meanwhile, we found that the IR region is extremely conserved (Pi < 0.005) because highly variable region/divergent sequences were not found.  Fig. 6. Similar phylogenetic topologies structures were found in the ML and BI nucleic acid analyses. The nine Aquilaria species are diverged into two major clades (Clade 1 and 2) showing a paraphyletic relationship, with a strong support as indicated from the high bootstrap values for SH-aLRT and UFBoot and posterior probability values (100%, 100%, and 1, respectively) (Fig. 6A). Clade 1 has three species of Malay Peninsula origin (A. hirta, A. beccariana and A. malaccensis) and one species of Borneo origin (A. microcarpa). Table 3. Variable site analysis shows the presence of singleton variable sites (SV) and parsimony informative sites (PIS) in the eight Aquilaria chloroplast genomes. SV2V singleton variable sites with two variants, SV3V singleton variable sites with three variants, PIS2V parsimony informative sites with two variants, and PIS3V parsimony informative sites with three variants.     www.nature.com/scientificreports/ (Fig. 6A). We conclude that the phylogenetic positions within the Aquilaria species reported here corresponds well with their natural geographic distribution pattern. When compared to the recent Aquilaria phylogenetic tree constructed using a concatenated dataset of five cp gene sequence (matK, rbcL, trnL intron, trnL-trnF, and psbC-trnS) 11 , a consistent clustering pattern was observed. However, our cp genome-ML tree has a higher statistical support. This shows that comparative analysis using complete cp genome reveals greater abundance in informative characters when compared to the short gene fragments in Aquilaria.   www.nature.com/scientificreports/ Figure 6B exhibits Aquilaria's position in relation to other member taxa in the family Thymelaeaceae. All Aquilaria species clustered into a strongly supported clade (100% ,100%, and 1) after G. affinis. Aquilaria is the first to diverge from Gonystylus followed by Phaleria and Daphne (Fig. 6B). Both latter taxa are under the Daphneae tribe, placed as sister to the Aquilarieae tribe in the subfamily Thymelaeoideae. Our findings are in agreement with the classification system by Herber (2003) 66 , who proposed two major subfamilies for Thymelaeaceae, Octolepidoideae (Gonystylus) and Thymelaeoideae (s.l.). The latter subfamily is further divided into tribes Aquilarieae (Aquilaria), Daphneae (Phaleria, Daphne, and Stellera) and Synandrodaphneae. Similarly, www.nature.com/scientificreports/ our results compliment the classification system of Domke (1934) 67 , as shown by Beaumont et al. 68 through a phylogenetic analysis involving 143 specimens from members of the Thymelaeaceae family and the combined dataset of rbcL + trnL-trnF + ITS. Aquilaria, placed under the subfamily Aquilarioideae, is shown to evolve after the Gonystyloideae (Gonystylus), and sister to Thymelaeoideae (Phaleria, Daphne, and Stellera) 66 .

conclusion
In this study, we report new complete cp genomes sequences from seven Aquilaria species and analyzed these genomes including another, which we recently published. The eight Aquilaria cp genomes were similar in genome content, structure, and gene order. Comparison of the eight Aquilaria cp genomes revealed 832 LSR and nine divergent regions (trnD-trnY, trnT-trnL, trnL-trnF, trnF-ndhJ, trnV-trnM, matK-rps16, rpoC1-rpoC2, petA-cemA and rpl32). Both ML and BI phylogenetic analyses strongly supported the phylogenetic positions within the Aquilaria species and their natural geographic distribution pattern. We have successfully revealed the complete cp genome sequences for eight Aquilaria species, in which five were native to Malaysia. Future studies should identify potential molecular markers to provide a clear discrimination between these important and closely related genetic resources.