A complete sequence of mitochondrial genome of Neolamarckia cadamba and its use for systematic analysis

Neolamarckia cadamba is an important tropical and subtropical tree for timber industry in southern China and is also a medicinal plant because of the secondary product cadambine. N. cadamba belongs to Rubiaceae family and its taxonomic relationships with other species are not fully evaluated based on genome sequences. Here, we report the complete sequences of mitochondrial genome of N. cadamba, which is 414,980 bp in length and successfully assembled in two genome circles (109,836 bp and 305,144 bp). The mtDNA harbors 83 genes in total, including 40 protein-coding genes (PCGs), 31 transfer RNA genes, 6 ribosomal RNA genes, and 6 other genes. The base composition of the whole genome is estimated as 27.26% for base A, 22.63% for C, 22.53% for G, and 27.56% for T, with the A + T content of 54.82% (54.45% in the small circle and 54.79% in the large circle). Repetitive sequences account for ~ 0.14% of the whole genome. A maximum likelihood (ML) tree based on DNA sequences of 24 PCGs supports that N. cadamba belongs to order Gentianales. A ML tree based on rps3 gene of 60 species in family Rubiaceae shows that N. cadamba is more related to Cephalanthus accidentalis and Hymenodictyon parvifolium and belongs to the Cinchonoideae subfamily. The result indicates that N. cadamba is genetically distant from the species and genera of Rubiaceae in systematic position. As the first sequence of mitochondrial genome of N. cadamba, it will provide a useful resource to investigate genetic variation and develop molecular markers for genetic breeding in the future.

The mitochondrial genome 2 is 305,144 bp in length (GenBank Access No. MT364442). The genome 2 contains 69 genes (Table 2), including 33 PCGs, 26 transfer RNA genes, 6 ribosomal RNA genes, and 4 other genes (ccmC, mttB, ccmB, ccmC). The PCGs are 8 NADH dehydrogenase genes (nad4L, nad2, nad3, nad5, nad6, nad3, nad9, nad4), 1 succinate dehydrogenase genes (sdh4), 2 ubichinol cytochrome reductase genes, 4 cytochrome c oxidase genes, 5 ATP synthase genes (atp4, atp1, atp9, atp1, atp8), 10 ribosomal proteins genes (6 rps, 4 rpl), and 3 ORFs. The 15 PCGs (nad4L, nad2, nad3, nad9, atp4, atp1, cob, cox2, rpl10, rpl5, rps13, rps12, rps4, orf954, orf108), 10 tRNA genes, and 3 rRNA genes and 1 other gene (ccmC) are located on the N-strand, and the remaining 18 PCG genes, 16 tRNA genes, 3 rRNA genes and 3 other genes (mttB, ccmB, ccmC) are on the J-strand. There are two overlapping regions, with 73 bp overlapping between cox3 and sdh4 and 817 bp overlapping between rps4 and tRNA-Leu. There are certain intergenic sequences among adjacent genes in the remaining genes, indicating www.nature.com/scientificreports/ relatively low density of gene distribution along the genome. This is consistent with the patterns of other plants where non-coding regions are the important parts in consisting of mitochondrial genome 40-42 . Characteristics of nucleotide composition. The two genome circles slightly differ in nucleotide composition (SI Table 1). Genome 1 has a high content of the T base but a low content of the G base. Besides the AT or GC content, the AT-and GC-skews are often used to assess the nucleotide-compositional differences in mitochondrial genomes 45 . From SI Table 1, both AT-and GC-skews in genome 1 are negative (AT-skew = −0.0128 and GC-skew = −0.0241), indicating that genome 1 has a higher percentage of T and C than A and G, respectively. Both AT-and GC-skews are negative in PCG sequences (AT-skew = −0.0408 and GCskew = −0.0501). However, the AT-skew in tRNAs is positive (0.0430), indicating that these genes have a higher percentage of A than T. The GC-skew in tRNAs is negative (−0.1135), indicating that these genes have a higher percentage of G than C.
In genome 2, the AT-skew (−0.0029) is negative but the GC-skew (0.0058) is positive (SI Table 1), indicating that genome 2 has a higher percentage of T and G than A and C, respectively. The GC-skews in both PCGs (−0.0115) and rRNAs (−0.1242) are negative, but positive in tRNAs (0.0449). The AT-skews are negative in PCGs (−0.0569), tRNA (−0.0289) and rRNAs (−0.0864). The extents of both AT-and GC-skews are greater in rRNAs than in PCGs and tRNAs. Generally, the extents of AT-and GC-skews in both genomes 1 and 2 are small, comparable to the pattern in mitochondrial genomes of Pyrus pyrifolia (AT-skew = 0.004, GC-skew = 0) 46 but different from that of animal species Ledra auditura 44 (AT-skew = 0.22 and GC-skew = 0.12).
Protein-coding genes and codon usage. Codon usage bias is an important character of a genome since it is associated with gene expression 47,48 , the base composition of genes 49 , amino acid composition 50 , GC content 51 , the length of a gene 52 and tRNA richness 53,54 . Large differences in the codon usage of genes often occur among different species and organisms 52 .
The mitochondrial genome of N. cadamba harbors a total of 83 coding genes and 45,639 bp in length, accounting for about 11% of the entire mitochondrial genome. This density is greater than those of watermelon (Citrullus lanatus; 10.3% of 379,236 bp), zucchini (Cucurbita pepo; 3.9% of 982,833 bp) 55 and neem (Azadirachta indica A. Juss; 7.7% of 266,430 bp) 56 mitochondrial genomes. The base composition of the whole mtDNA of N. cadamba is 27.26% for A, 22.63% for C, 22.53% for G and 27.56% for T, exhibiting a AT-biased pattern, with the A + T content of 54.82%. The AT-biased pattern is frequently observed in both plant and animal mitochondrial genomes 57 .
The mitochondrial genomic protein-coding genes of N. cadamba are 37,521 bp in length, accounting for 83.03% of all coding genes. The 40 protein-coding genes encode a total of 12,507 codons. Figure 2 shows the Table 1. Annotations and characteristics of mitochondrial genome 1 of Neolamarckia cadamba. *J stands for the direction of a gene from 5′ to 3′, and N for the direction of a gene from 3′ to 5′. www.nature.com/scientificreports/ frequencies of different amino acids in the protein-coding genes where the amino acid Leu is most frequently used, followed by Ser, Ile and Gly. According to the RSCU values, codons are classified into optimal codons (RSCU > 1) and non-optimal codon (RSCU < 1). From Fig. 2 and SI Table 3, each amino acid has its preferred codon, with exception of amino acids Met (ATG) and Trp (TGG) that have only one codon and no preference.
A universal genetic code is used for all mitochondrial genes in angiosperms, and the third codon tends to be A or T 58 . A typical translation initiation codon is ATG, but alternative initiation codons occur in translation of rpl16 59 , mttB 52 , and matB genes. The initiation codon of the protein-coding genes in the mitochondrial genome of N. cadamba is ATG, except for rps10 and rpl16 where ACG is the initiation codon.
Transfer RNA and ribosomal RNA genes. There are 5 tRNA genes in genome 1, with a total length of 371 bp ( Table 1). The five tRNA genes range from 71 (tRNA-Arg) to 84 bp (tRNA-Tyr) in length, of which four genes are on the N-strand and one gene is on the J-strand. There are 26 tRNA genes in genome 2, with a total length of 1,909 bp (Table 1). These genes range from 60 bp (tRNA-Val) to 88 bp (tRNA-Ser) in length, of which ten genes are on the N-strand and sixteen genes are on the J-strand.
The secondary structure map of tRNA was predicted and generated using tRNAscan-SE 2.0 (http:// lowel ab. ucsc. edu/ tRNAs can-SE/) 60 and ARWEN (Version1.2, http:// mbio-serv2. mbioe kol. lu. se/ARWEN/) 61 . Structurally, tRNA-Ser (GCT), tRNA-Ser (TGA) and tRNA-Tyr (GTA) have a group of stem-loop structure on the extra loop between the TψC loop and the anti-codon stem, but the remaining tRNA genes are the typical clover-type secondary structure (SI Fig. 1).  www.nature.com/scientificreports/ In the secondary structure of tRNA, besides three classic base matches (A-T, G-C and G-T), there are also mismatches, such as G-A, A-C, T-T, T-C and A-A. The T-C and A-A mismatch pairs are only in the anti-codon stems. Three G-A pairs are in the amino acid acceptor stems, and the other three G-A mismatch pairs are in the TψC stems. Two A-C pairs are in the TψC stems, and the other five A-C pairs are in the amino acid acceptor stems. One T-T pair is in the amino acid acceptor stems, and the other four T-T pairs are in the anti-codon stems.
The mitochondrial genome of N. cadamba has 3 rRNA genes in total (rrn18, rrn5, and rrn26), ranging from 116 bp to 3,429 bp in length, and all rRNA genes are on the N-strand. Table 2 indicates that both genome 1 (~ 0.08%) and genome 2(~ 0.16%) have small proportions of repetitive sequences, with the repeat length of 579 bp in total. Most repetitive sequences (microsatellites) consist of single-and di-nucleotide repeats, with more numbers of (A) n and (T) n (14) than (G) n and (C) n (2), and more numbers of (AT) n and (TA) n (8) than others (2 (GA) n ). Three minisatellites are present in genome 2. All these repeats are not located in protein-coding regions except (T) 10 in orf309 of genome 1 and (A) 10 in Atp1 of genome 2. The small proportion of repeats implies that repetitive sequences do not play an important role in contributing to mitochondrial genome size of N. cadamba, different from the patterns of Nymphaea colorata 42 and other plants 40 . However, these repetitive sequences could be used to develop molecular markers for population genetic structure analysis in the future.
The rps3 gene sequence of 60 species of Rubiaceae was available from NCBI GenBank. Phylogenetic genetic relationships based on this single gene was constructed using the maximum likelihood method. SI Fig. 2 shows that N. cadamba is genetically close to Cephalanthus occidentalis and Hymenodictyon parvifolium. These three species together with Cubanola domingensis, Hillia triflora and Rondeletia odorata provide evidence that they belong to the Cinchonoideae subfamily although Deppea grandiflora (Ixoroideae subfamily) and Guettarda scabra (Rubioideae subfamily) were incompletely sorted in this clade. Using cpDNA segments (rbcL, rsp16 intron, nadhF, atpB-rbcL spacer) and nuclear ribosomal ITS, Rydin et al. 64 showed that five species (C. occidentalis, H. parvifolium, C. domingensis, H. triflora and R. odorata) belong to Cinchonoideae subfamily. The whole phylogenetic relationships indicate that large genetic divergence and incomplete linage sorting occurred among the three subfamilies of Rubiaceae in terms of the rps3 gene sequence.

Conclusions
In this study, we sequenced the mitochondrial genome of N. cadamba and successfully assembled the genome in two maps of circular molecule structure. Genome 1 has 109,836 bp and contains 14 genes. Genome 2 has 305,144 bp and contains 69 genes. The whole genome has slightly high AT content (54.82%). Genome 1 shows negative AT-and GC-skews, while genome 2 shows a negative AT-skew but a positive GC-skew. All proteincoding genes are initiated by the start codon ATG, except for a few genes initiated by alternative codons. The termination codes are TAA for most genes but TGA or TAG for a few genes. Each amino acid has its preferred codon except amino acids Met (ATG) and Trp (TGG) that have only one codon and no preference. The tRNA genes exhibit a typical clover-type secondary structure except tRNA-Ser (GCT), tRNA-Ser (TGA) and tRNA-Tyr (GTA) that have an extra loop between the TψC loop and the anti-codon stem. Tandem repeat sequences are www.nature.com/scientificreports/ minor, accounting for ~ 0.14% of the whole genome. Phylogenetic analysis with the DNA sequences of 24 PCGs confirms that N. cadamba belongs to order Gentianales. Analysis with a single gene rps3 of 60 species shows that N. cadamba is genetically closer to Cephalanthus accidentalis and Hymenodictyon parvifolium and belongs to the Cinchonoideae subfamily.  www.nature.com/scientificreports/ leaves in this study complies with institutional guidelines. Collection of the plant specimen was permitted by the University. Total genomic DNA was extracted from fresh leaves using CTAB method 65 . Then the quality of the extracted DNA samples was tested using (1)  Library construction and high-throughput sequencing. High-quality genomic DNA of 50 μg was used to generate a 40-kb SMRTbell library, with the size selection on the BluePippin (Sage Science, USA). The genomic DNA library was sequenced on the PacBio sequel platform (Pacific Biosciences, USA). SMRTbell DNA library preparation and sequencing were performed in accordance with the manufacturer's protocols (Pacific Biosciences, USA), and totally 2 Gb subreads were generated. In order to check the correction of PacBio assembly, an insert size of 500 bp pair-end genomic DNA library for Illumina Hiseq 4000 (Illumina, USA), was constructed by Science Corporation of Gene according to the standard protocol of Illumina. DNA library was constructed after quality control with Agilent 2100 Bioanalyzer (Agilent Technologies, USA). Four gigabytes DNA data were sequenced by Illumina Hiseq 4000 (Illumina, USA). Different sequencing methods were used in this study because lengths of PacBio sequencing reads were up to 40 kb, which was more suitable for complex genome assembly. However, the PacBio long reads potentially had much more sequencing errors, and the Illumina short reads were then used to fix the errors.
Comparative analysis of mitochondrial genomes. The use of mitochondrial codons had a preference, which would affect the expression of genes and reflect the evolutionary relationship of species to a certain extent. The calculation of relative synonymous codon usage was analyzed with a reference to the formula mentioned in Sharp and Li 74 . The relative synonymous codon usage (RSCU) was calculated as the ratio of the frequency of a focal codon to the mean frequency of all synonymous codons in a given protein-coding sequence. The usage bias of one synonymous codon is indicated when RSCU is not equal to 1; no usage bias is present when RSCU is equal to 1.
In most bacterial genomes, mitochondrial and plastid genomes, there are significant differences in base composition between heavy and light chains, which are called AT-skew and GC-skew. Calculations of the AT-and GC-skews are as follows 75 : where A%, T%, G% and C% represent the percentages of A, T, G and C in a given sequence, respectively. Phylogenetic analyses. MUSCLE v.3.8.31 (http:// www. drive5. com/ muscle/) software 76 was used to compare individual genes among multiple species, and then the genes of each species were aligned in a certain order. The protein-encoding gene sequence set of each species was generated by catenating 24 PCG sequences in the same gene order for further analysis. jModelTest2.1.7 (https:// code. google. com/p/ jmode ltest2/) was used to test the nucleic acid model of the selected sequence DNA 62 , and the best model has the minimum AIC (Akaike Information Criterion) value. Phylogeny tree was constructed with RAxML8.1.5 software (https:// sco.h-its. org/ exeli xis/ web/ softw are/ raxml/ index. html) 63 using the maximum likelihood (ML) method for both the catenated sequences of 23 species and the rps3 gene sequences of 60 species. The bootstrap value was set to be 1000 for each phylogenetic tree analysis.