Complete mitochondrial genome of Bactrocera arecae (Insecta: Tephritidae) by next-generation sequencing and molecular phylogeny of Dacini tribe

The whole mitochondrial genome of the pest fruit fly Bactrocera arecae was obtained from next-generation sequencing of genomic DNA. It had a total length of 15,900 bp, consisting of 13 protein-coding genes, 2 rRNA genes, 22 tRNA genes and a non-coding region (A + T-rich control region). The control region (952 bp) was flanked by rrnS and trnI genes. The start codons included 6 ATG, 3 ATT and 1 each of ATA, ATC, GTG and TCG. Eight TAA, two TAG, one incomplete TA and two incomplete T stop codons were represented in the protein-coding genes. The cloverleaf structure for trnS1 lacked the D-loop, and that of trnN and trnF lacked the TΨC-loop. Molecular phylogeny based on 13 protein-coding genes was concordant with 37 mitochondrial genes, with B. arecae having closest genetic affinity to B. tryoni. The subgenus Bactrocera of Dacini tribe and the Dacinae subfamily (Dacini and Ceratitidini tribes) were monophyletic. The whole mitogenome of B. arecae will serve as a useful dataset for studying the genetics, systematics and phylogenetic relationships of the many species of Bactrocera genus in particular, and tephritid fruit flies in general.

Scientific RepoRts | 5:15155 | DOi: 10.1038/srep15155 based on cox1 gene sequences 6 , and Virgilio et al.'s study of 56 Bactrocera taxa using cox1 and rrnL gene fragments 7 . To date there are only 7 nucleotide sequences for B. arecae in GenBank -2 for cox1 and one each for ITS-1, cox2, nad1, rrnL and rrnS. At the mitochondrial genome (mitogenome) level, only 10 whole genomes of Bactrocera taxa are available in GenBank.
We report here the whole mitogenome of B. arecae determined using next-generation sequencing (NGS) and discuss the molecular phylogeny of Dacini tribe.

Results
Mitogenome analysis and features. Next-generation sequencing on NextSeq 500 Dekstop Sequencer generated an approximately 100 giga bases data from B. arecae library. Removal of low quality sequence (< Q20), PhiX reads and sequences shorter than 50 nucleotides resulted in 459 million pairedend reads with 57 billion bases. A total of 2,451,320 sequence reads with a total of 330,821,510 bases were mapped to full mitochondrial genome reference sequences of Bactrocera genus. De novo assembly of these mapped reads resulted in 164 contigs with maximum length of 15,925 bp and N50 of 485. The total GC content was 28.4%, with base composition of 33.8% A, 37.9% T, 17% G, and 11.4% C.
The commonest start codon was ATG (in 6 PCGs -cox2, atp6, cox3, nad4l, nad4, cob), followed by three for ATT (nad2, nad3, nad6), and one each for ATA (nad1), ATC (nad5), GTG (atp8) and TCG (cox1). Eight PCGs had TAA stop codon, two had TAG while the remaining three genes had incomplete stop codons (TA in cox1; T in nad1 and nad3) ( Table 1). Table 2 summarizes the base composition of the mitochondrial whole genome, protein-coding genes, rRNA genes and control region. All were A + T rich. The A + T content for PCGs ranged from 63.3% (cox1) to 78.8% (nad4l). Eight PCGs (atp8 and all 7 nad) had A + T content of over 70%. The A + T content of the non-coding control region was 86.0%. The GC skewness values for the whole genome, PCGs, rRNA genes and control region were negative (− 0.120 to − 0.487) indicating bias toward the use of Cs over Gs. Although the AT skewness value was positive (0.080) for the whole genome, it was variable for individual genes.
Of the tRNAs, the cloverleaf structure for trnS1 lacked the D-loop, while trnN and trnF lacked the TΨ C-loop (Fig. 2). The number of base pairs in the DHU-stem ranged from 3 to 4 ( Fig. 2; Supplementary  Table S1). All the TΨ C-stems had 5 base pairs except 4 bp in trnC, trnH, trnP and trnS1. The number of bases in the D-loop and TΨ C-loop was variable.
Phylogenetic relationships within Dacini tribe. Figure  The subfamily Dacinae was also monophyletic and clearly separated from Tephritinae subfamily. However, based on rrnL and rrnS genes the subfamily Dacinae was not monophyletic as Procecidochares utilis (Tephritinae subfamily) was sister to the Dacini (Fig. 4).

Discussion
Mitochondrial genomes of insects have been very extensively studied 8 . They have been applied particularly to studies regarding phylogeny and evolution 8 . To date there are 12 complete mitogenomes of tephritid fruit flies in GenBank -10 from Bactrocera (Dacinae, Dacini), and 1 each from Ceratitis (Dacinae, Ceratitidini) and Procecidochares (Tephritinae, Cecidocharini).
The mitogenome size of B. arecae (15, In B. arecae mitogenome, in addition to 8 TAA, 2 TAG, and 3 incomplete (1 TA and 2 T) stop codons are represented in the protein-coding genes ( Table 1). This differs from some tephritid mitogenomes which possess additionally truncated TA stop codon (e.g. B. dorsalis, B. oleae) 9,10 or TAT stop codon (B. minax) 11 . Among the tephritid fruit flies, over half of the PCGs in B. dorsalis have truncated stop codons (3 TA and 4 T) 9 . The incomplete T-stop codons can be converted to TAA by post-translational polyadenylation 12 . Additionally, the nad5 gene of B. arecae has ATC instead of ATT start codon found in congeners (B. dorsalis, B. minax, B. oleae).
As in other insects, the B. arecae mitogenome has three main clusters of characteristic tRNAs ( Fig. 1): (1) I-Q-M (isoleucine, glutamate and methionine); (2) W-C-Y (tryptophan, cysteine and tyrosine); and (3) A-R-N-S1-E-F (alanine, arginine, asparagine, serine S1, glutamate and phenylalanine). The atypical cloverleaf structure of trnS1 is common in all Metazoa 13 . Our finding of B. arecae forming a sister group with B. tryoni and B. minax with B. oleae based on 13 PCGs and 37 mt-genes (Fig. 3, Supplementary Fig. S2) concurred with the phylogeny based on cox1 sequences 6 . However, based on 13 PCGs and 37 mt-genes the sister group of B. arecae plus B. tryoni was sister to the B. dorsalis lineage. B. correcta was sister to the B. dorsalis lineage based on one study of cox1 sequences 6 . Other studies based of cox1 14 and multi-gene sudies 5 also indicated closer affinity of B. arecae and/or B. tryoni to the B. dorsalis complex than of B. correcta to B. dorsalis. However, the analysis based on concatenated cox1 and rrnL sequences indicated B. tryoni to be closer to B. dorsalis complex while B. arecae and B. correcta belonged to different lineages 7 .
In summary, we have successfully sequenced the whole mitochondrial genome of B. arecae by next-generation sequencing. The genome features are similar to other tephritid fruit flies except the ATC start codon for nad5 gene and the absence of TΨ C-loop in trnN and trnF tRNAs. The phylogenetic tree based on 13 PCGs is concordant with that based on 37 mt-genes. B. arecae shows closest genetic affinity to B. tryoni. Sequence and genome analysis. Raw sequences were extracted from the Illumina NextSeq 500 system in FASTQ format and the quality of sequences was evaluated using the FastQC software 22 . All the ambiguous nucleotides and reads with an average quality value (lower than Q20) were excluded from further analysis. The trimmed sequences were mapped against three reference mitogenomes, namely, Bactrocera dorsalis (NC_008748), Bactrocera carambolae (NC_009772) and Bactrocera dorsalis (= papayae) (NC_009770) using the CLC Genomic Workbench v.7.0.4 (Qiagen, Germany). The mapped sequences were then subjected to de novo assembly. High coverage contigs (average coverage value more than 500) greater than 15 kbp were subjected to BLAST 23 alignment against the nucleotide database at National Center for Biotechnology Information (NCBI). Contigs with hits to mitochondrial genes or genomes were identified and extracted from CLC Genomic Workbench.

Mitogenome identification and annotation.
A contig identified as mitogenome was manually examined for repeats at the beginning and end of the sequence to establish a circular mtDNA. It was then annotated with MITOS 24 followed by manual validation of the coding regions using the NCBI ORF Finder (http://www.ncbi.nlm.nih.gov/gorf/gorf.html). The sequin file generated from MITOS was edited and submitted to NCBI according to ORF Finder result (NCBI GenBank accession number KR233259).

Mitogenome visualization. The circular mitogenome of B. arecae was visualized with Blast Ring
Image Generator (BRIG) 25 .
Phylogenetic analysis. Mitogenome sequences (12 taxa) of Tephritidae family that were available in GenBank were used with B. arecae to reconstruct phylogenetic trees (Figs 3 and 4; Supplementary Fig. S1). Drosophila incompta NC_025936, D. melanogaster NC_024511 and D. yakuba NC_001322 of the Drosophilidae family were included as outgroups.
Nucleotide sequences of the 13 protein-coding genes (PCGs) were separately aligned using ClustalX v.1.81 26 program and subsequently edited and trimmed using BioEdit v.7.0.5.3 27 . The sequences of rrnS, rrnL and 22 mt-tRNA genes were aligned by MAFFT v.7 28 . The best-fit nucleotide substitution models Scientific RepoRts | 5:15155 | DOi: 10.1038/srep15155 for maximum likelihood (ML) using the corrected Akaike Information Criterion 29 and Bayesian (BI) analyses using the Bayesian Information Criterion 30 were determined by Kakusan v.3 31 . Phylograms of 13 concatenated PCGs, 37 mt-genes and 2 rRNA genes were constructed using TreeFinder 32 . Bootstrap values (BP) were generated via 1,000 ML bootstrap replicates. Bayesian analyses were conducted using the Markov chain Monte Carlo (MCMC) method via Mr. Bayes v.3.1.2 33 , with two independent runs of 2 × 10 6 generations with four chains, and with trees sampled every 200 th generation. Likelihood values for all post-analysis trees and parameters were evaluated for convergence and burn-in using the "sump" command in MrBayes and the computer program Tracer v.1.5 (http://tree.bio.ed.ac.uk/software/tracer/). The first 200 trees from each run were discarded as burn-in (where the likelihood values were stabilized prior to the burn-in), and the remaining trees were used for the construction of a 50% majority-rule consensus tree. Phylogenetic trees were viewed and edited by FigTree v.1.4 34-36