Mitochondrial genome diversity and population structure of two western honey bee subspecies in the Republic of South Africa

Apis mellifera capensis Eschscholtz and A.m. scutellata Lepeletier are subspecies of western honey bees that are indigenous to the Republic of South Africa (RSA). Both subspecies have invasive potential and are organisms of concern for areas outside their native range, though they are important bees to beekeepers, agriculture, and the environment where they are native. The aim of the present study was to examine genetic differentiation among these subspecies and estimate their phylogenetic relationships using complete mitochondrial genomes sequences. We used 25 individuals that were either assigned to one of the subspecies or designated hybrids using morphometric analyses. Phylogenetic analyses of mitogenome sequences by maximum likelihood (ML) and Bayesian inference identified a monophyletic RSA clade, subdivided into two clades. A haplotype network was consistent with the phylogenetic trees. However, members of both subspecies occurred in both clades, indicating that A.m. capensis and A.m. scutellata are neither reciprocally monophyletic nor do they exhibit paraphyly with one subspecies nested within the other subspecies. Furthermore, no mitogenomic features were diagnostic to either subspecies. All bees analyzed from the RSA expressed a substantial level of haplotype diversity (most samples had unique haplotypes) but limited nucleotide diversity. The number of variable codons across protein-coding genes (PCGs) differed among loci, with CO3 exhibiting the most variation and ATP6 the least.

honey bees (including two published in 37,38 (Table 1). The assembled mitogenome sequences of all 25 honey bees were circular molecules that ranged in length from 16,338-16,513 bp ( Table 2). On average, we mapped over 70,000 Illumina HiSeq. 3000 paired-end reads (2 × 100 bp) per honey bee sample (range from 33,469-390,359), corresponding to a sequence depth that averaged 479× (range from 204-2,392×). The number of ambiguous sites in final assemblies was low (0.01%). These ambiguous sites were largely in non-coding regions (particularly the AT-rich region). Only one sequence had ambiguities in any of the protein-coding genes (PCGs) or rRNAs. Thus, our results demonstrate the effectiveness of genome skimming for the assembly of complete mitochondrial sequences.
Mitogenome content and organization was consistent with published data 37,38 with 38 regions: 13 protein-coding genes (PCGs), two ribosomal RNAs (lrRNA and srRNA), 22 tRNAs, and an AT-rich non-coding region. The mitochondrial gene orders and arrangements were identical to those of other published A. mellifera mitochondria [37][38][39][40] . The genes encoded on heavy and light strands and the initiation and termination codons for all individuals are identical to those from our previous publications 37, 38 . There were not cases where PCGs appeared to have an unusual start or stop codon; all PCGs initiated with ATA, ATC, ATG, or ATT and all ended with TAA or with a T that can be polyadenylated to form a TAA stop codon. All subspecies exhibited the same patterns of start and stop codon usage, such that the same start codon was found in all samples of all subspecies for a given gene. If the T was polyadenylated to form a stop codon, it was the same across all samples.
The nucleotide composition of the mitogenomes of A.m. capensis, A.m. scutellata and hybrids were strongly biased toward A and T with an average AT content of ~85% across the entire genome (Tables 3 and S1). Base composition did vary among regions of the mitochondrial genomes, with the highest AT content being found in the AT-rich region, as expected, and the lowest being found in COI (cytochrome oxidase I; Table 2). The number of variable codons differed among the PCGs (Table 2), with Cytochrome c oxidase subunit 3 (CO3) exhibiting the highest degree variation (over 4% of amino acids sites varied among samples) while ATP Synthase 6 (ATP6) showed much greater conservation (0.9%). Both NADH dehydrogenase 4 (ND4) and NADH dehydrogenase 5 (ND5) exhibited variation in the number of codons, with most variation observed in A.m. capensis (Table 2). There were no amino acid substitutions that presented synapomorphies uniting any of the subspecies for which multiple individuals were sampled. Instead, substitutions were either found in just a few individuals of a subspecies, were unique to one sample, or were shared among subspecies. The levels of variation differed among gene regions (Table 2), with the PCGs averaging just under 5% of variable sites. While the AT-rich region showed the greatest variation, some of this may be due to errors in alignment since the extreme base composition (over 95% AT) made identification of homology within this region challenging.
When comparing the RSA subspecies and hybrids, there was little variation and the subspecies did not appear distinct in overall characteristics of the mitogenome (Table 3). While the length of the mitogenome, the PCGs, the rRNAs and the AT-rich region varied, this variation occurred within subspecies and did not define subspecies (Table 3).  Table 1. Summary information for honey bee samples collected in the Republic of South Africa. Bees were sampled from 25 apiaries. Geographical region is identified by the main city/town closest to the sampled apiary and is abbreviated for use elsewhere in the manuscript. Subspecies identity was confirmed morphometrically following 9 . The GPS location of each apiary is noted (geographical coordinates). Finally, we include the GenBank accession numbers for the sequenced mitogenome of each bee.  Table 3. Genome size and nucleotide compositions among the mitogenomes of Apis mellifera capensis, A.m. scutellata and hybrid honey bees. * PCGs: protein-coding genes, includes stop codons. † L-rRNA: Large subunit of ribosomal RNA. ∫ S-rRNA: Small subunit of ribosomal RNA.

Figure 1.
Maximum likelihood phylogenetic tree constructed with RAxML approach for the 39 Apis mellifera mitogenomes using concatenated sequences (13 PCGs + two rRNAs). The number represents the bootstrap values which are shown behind each node. The uppermost clade shows the identity of each bee as S for A.m. scutellata, C for A.m. capensis, or H for hybrid. The geographic origin of each bee follows its identity (S, C, or H) and is reported as the abbreviation noted in Table 1.
The two datasets (complete mitogenome or 13 PCGs and two rRNAs) largely yielded similar results (e.g., compare Figs 1 to 2, 3 and 4), except that the placement of A.m. intermissa is in a different phylogenetic position when using the two data sets. Apis mellifera intermissa clusters with the RSA clade when complete mitogenomes are analyzed (Figs 3 and 4), whereas it is nested within a larger A.m. mellifera clade when using the concatenated   (Figs 1 and 2). Thus, the inclusion of the harder-to-align data does have some impact on phylogenetic reconstruction. The inclusion of more sites, especially more variable sites, is expected to result in higher support values. However, there is no consistent pattern with some relationships showing higher support with the complete mitogenome, and others with the concatenated dataset. This may reflect that some sites in the non-coding regions are misaligned, so the increased number of variable sites in complete mitogenome dataset may be offset by greater noise in this alignment. This emphasizes the importance of focusing on regions than can be aligned with confidence.
Using the concatenated data set, the maximum value of the uncorrected genetic distance (p-distance) for all pairs of taxa that included one of the 25 RSA mitogenomes was between A.m. ligustica (0.015). The minimum value (0, indicating identical sequences) was only found in nine pairwise comparisons among the 25 South African mitogenomes (data not shown). The overall mean value of pairwise p-distance across all 39 mitogenomes was 0.006. The Nei genetic distances within groups of A.m. capensis and A.m. scutellata were 0.002 and 0.001, with an overall distance 0.002.
Population comparisons and haplotype network. The level of haplotype diversity was extremely high for the subspecies for which we had two or more samples. Each honey bee contained a unique haplotype with the exception of samples from hybrid region (KL) and A.m. scutellata (S_NJ601784) which shared a single haplotype (Fig. 5, Table S2). Although haplotype diversity was high among all subspecies, nucleotide diversity was lower among the RSA subspecies (and hybrids) than among A.m. mellifera samples (Table 4). Similarly, the average number of nucleotide differences (K), is much higher for A.m. mellifera than for the RSA subspecies ( Table 4). The haplotype network of the concatenated data set generated 38 distinct haplotypes, supporting the results of the phylogeny that showed no population differentiation at the subspecies level (Fig. 5, Table S2).

Discussion
In this study, we compared whole mitogenomes from A.m. capensis, A.m. scutellata and hybrids in the RSA, along with published mitogenomes from three other subspecies to infer phylogenetic relationships and genetic differentiation among them. Our phylogenetic analyses suggested that the RSA honey bees are genetically closer to one another than to other taxa included in the tree. Although A.m. capensis and A.m. scutellata are genetically indistinguishable when using mitochondrial genomes, morphological analyses do separate these into two distinct clusters 9 . Furthermore, A.m. capensis and A.m. scutellata have distinctive physiological and behavioral differences 8,9,11 . Thus, the two subspecies are quite distinct and should not be considered as a single subspecies, even if currently no genetic diagnostic markers have been identified for either subspecies.
Although mitogenomes show significant rearrangement of gene orders among Insecta a putative ancestral gene order has been inferred, based in part on studies of Hemiptera 41,42 . There have been rearrangements on the branch leading to Apis 42 but arrangement and gene orders observed in the examined mitogenomes for honey bees from the RSA were identical to those reported for other honey bees, supporting conservation of mitogenome gene order within this genus 40,43,44 . It is possible that there will not be major rearrangements within Apis but it is likely that A. mellifera needs to be studied more broadly across the complete modern diversity of Apis 37,38 to establish this with confidence.
The consensus mitogenomes obtained in this study varied slightly in size: ±170 bp in A.m. capensis; ±141 bp in A.m. scutellata; and ±115 bp for the hybrids. Variation in mitogenome size is typically the consequence of variation in the non-coding regions in insects 45 , which is consistent with our observations of A. mellifera mitogenomes. Only limited size variation was detected in the coding and rRNA regions. The mitogenomes of A. mellifera has an AT-bias (84.9%), with guanine as the rarest nucleotide 43 , which is consistent with other observations that the mitochondrial genomes of insects in general have very strong A+T base composition in the non-coding control region 46,47 .
We observed different levels of variation in the PCGs genes across all datasets. The greatest variation looking at nucleotides was in ND4L (followed by ND4), though at the amino acid level the greatest variation was in the CO3 gene. For both metrics, ATP6 exhibited the highest level of conservation. De Jager et al. 45 found that the greatest variation in the Diuraphis noxia (Hemiptera: Aphididae) was observed in the ND5 nucleotide gene but the causal effects of variation in ND5 was not fully understood. These results suggest that the degree of stabilizing selection differs on different genes within the mitochondria, and that this likely also differs among different taxonomic groups. However, too little information currently exists on insect mitochondria to elucidate why there may be different levels of selection among genes, and how universally there are differences among different taxonomic groups.
Using the complete mitochondrial genome sequences, we found that A.m. intermissa clustered at an intermediate position (Figs 3 and 4). The phylogenetic position of A.m. intermissa has been reported by previous investigators and supported by the present study, indicating that A.m. intermissa might be a hybrid subspecies or that the sampled bee was misidentified as a true representative of the subspecies 48 . We support sequencing of additional samples to confirm the phylogenetic position of this subspecies.
Mitochondrial DNA has been widely used as a marker in population genetic, biogeographic, phylogenetic and DNA barcoding studies 49 . Due to their relative rapid coalescence, one should be able to distinguish reciprocally monophyletic groups even when analyses of nuclear markers fail to do so. However, based on our results, mitogenomes appear to be poor diagnostic phylogenetic markers in Apis, as has also been found in some other taxa 50 . Our results are not likely due to specific mitochondrial inheritance patterns since similar results have been reported for these subspecies using nuclear SNP loci 27,28 . The absence of genetic differentiation, even with mitochondrial DNA sequences, in taxa where there are clear phenotypic differences among populations has been found in a variety of invertebrate and vertebrate taxa [51][52][53] , and may be due to the retention of ancestral genetic variation in recently diverged lineages.
An alternative explanation for the lack of genetic differentiation between A.m. capensis and A.m. scutellata could also be due to the impact of beekeeping activities, which can involve the exchange of queens and colonies. These are known to contribute to the admixture pattern among honey bee populations 8,54 . This, in fact, led to a major problem in RSA known as the "capensis calamity." This occurred because A.m. capensis workers were moved from southern portions of RSA into northern portions outside their native range 55 . The A.m. capensis workers became social parasites of A.m. scutellata colonies there, which is a possible reason that their mitogenomes are represented in bees that were morphometrically identified as A.m. scutellata 56 . This problem has slowed or nearly eliminated the movement of managed hives between southern and northern RSA, so the problem should be diminishing. That said, in spite of the movement of bees, the morphological and physiological differences between A.m. capensis and A.m. scutellata persist. The high genetic diversity in A.m. capensis, A.m. scutellata, hybrids, and other subspecies could be due to large population sizes within their natural habitats 9 . The observed low nucleotide diversity with high haplotype diversity observed in both RSA subspecies might suggest a genetic bottleneck occurred in the past, followed by rapid population expansion for both subspecies. This hypothesis could be investigated via the sampling of more sites and more individuals throughout RSA.
While SNP analyses can differentiate other honey bee subspecies belonging to different evolutionary groups 29,57,58 , mitogenome sequences are not able to differentiate A.m. capensis and A.m. scutellata. There are several reasons this could occur. First, while some hybrids can be identified morphologically, there may be others (particularly if hybrids backcross to one of the parentals) that are morphologically like one subspecies, but may retain the mitogenome of the second subspecies. Second, there could be incomplete lineage sorting, resulting in a mitochondrial gene tree that does not match the subspecies tree 59 . Finally, a small number of genes may be responsible for the phenotypic differences among subspecies, while the remainder of the nuclear genome (as well as the mitogenome) have not yet become reciprocally monophyletic 60 . If this is the case, then the small number of genes responsible for this differentiation would be expected to provide diagnostic markers. Exploration of complete or mostly complete nuclear genomes may be necessary to understand patterns of genetic differentiation, gene flow and divergence among A.m. capensis and A.m. scutellata. Such analyses may eventually identify diagnostic markers, thereby allowing more rapid identification of invasive species. Such methods would enhance our ability to detect hidden genetic structure among regional populations of honey bees in RSA.

Methods
Honey bee samples, DNA extraction and sequencing. Honey bee samples were collected from managed colonies located across 25 geographical regions in RSA (Table 1, Fig. 6) in April/May 2013 and April 2014 61 . During the field collections, 50-70 adult worker bees were collected from the brood nest of 1-10 colonies located in 1-3 apiaries sampled per each of the 25 geographic regions. In total, 1,000+ colonies were sampled. All bees were collected into 50 ml tubes containing ≥ 98% ethanol. After collection, the samples were imported into the U.S. per USDA APHIS regulations and approval and stored at −80 °C. Bustamante et al. 62 conducted a morphometric analysis on 10 bees from a subset of sampled colonies (N = 240) to classify colonies as A.m. capensis, A.m. scutellata, or hybrids according to standard procedures 9,63 . The thorax of each bee that was identified morphometrically was preserved in 95% ethanol and stored at −80 °C until molecular processing.
One bee from each geographic region was selected for use in the present study. In total, 17 A.m. capensis, six A.m. scutellata, and two hybrid bees were used (Table 1 and Fig. 6), including two whose complete mitogenomes had been published previously 37,38 . Total genomic DNA was extracted from the thoracic tissue of each bee using a Wizard ® Genomic DNA Purification kit (Promega, USA) according to the manufacturer's instructions.
The quality of DNA was checked using a 1% agarose gel and quantified using a Quant-iT PicoGreen dsDNA Assay Kit (Life Technologies). From this, 500 ng of DNA was utilized for the construction of paired-end libraries  comparable with Illumina sequencing by RAPiD Genomics (Gainesville, Florida, USA). The constructed libraries were multiplexed and sequenced with PE-100 cycle runs (2 × 100 bp) using Next Generation Sequencing on an Illumina HiSeq. 3000 platform (San Diego, California, USA).
Sequence quality control, assembly and annotation. Mitochondrial data were obtained using the genome skimming method 64 . Short reads were filtered by quality in two steps using the fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). The first removed bases from the 3′ end of the sequence that had a phred score below 20 and removed the entire read if the resulting length was <50 nt (fastq_quality_trimmer -Q 33 -t 20 -l 50). In the second step, reads that passed the first criteria, but that had ≥10% of the bases with a phred score <20 (fastq_quality_filter -Q 33 -q 20 -p 90) were removed. The resulting sequences were aligned to the bee mitochondria reference genome (NCBI accession NC_0015661) using Mosaik Aligner 65 version 2.1.33 (parameters -mmp 0.05, -ls 500 -m all -a all).
Illumina sequence reads were mapped to A.m. capensis and A.m. scutellata (KX870183 and KJ601784.1) mitogenomes using Geneious R9.1 66 , and the assembly that exhibited the fewest conflicts (e.g., that mapped to A.m. capensis or to A.m. scutellata) was used. The mapping procedures are described in 37 .
The locations and orientation of protein-coding genes (PCGs), transfer RNA genes (tRNAs) and ribosomal RNA genes (rRNA) genes were identified by multiple alignments to the reference mitogenome using Geneious R9.1 66 . All PCGs for each honey bee mitogenome were translated into amino acids and were manually checked to ensure that each could encode a functional protein (i.g., we examined putative PCGs to determine whether they lacked start or stop codons or exhibited frame shifts and/or premature stop codons. Sequence alignment. We aligned 23 novel complete mitogenome sequences with two additional mitogenomes that we published previously 37,38 , as well as 14 published mitogenomes primarily from other A. mellifera subspecies (Table S3). All mitogenomes were aligned using Muscle 3.8.31 67 implemented in Mesquite 3.04 68 with default parameters. The alignment was manually adjusted to maintain reading frame integrity in the protein coding genes.
Our alignment included the complete mitogenome nucleotide sequence. However, since homology was harder to assign in some regions (particularly intergenic regions and the AT-rich region), we also analyzed a concatenated alignment that included all 13 PCGs and two rRNAs to represent a dataset in which we had higher confidence in the alignment. This allowed us to determine if our conclusions depended upon alignment or were robust to inclusion of specific regions and homology assessment in non-coding regions. We estimated p-distance between taxa using MEGA7.
Phylogenetic analyses. Phylogenetic analyses of the complete (entire mitogenome) and concatenated datasets (13 PCGs and two rRNAs) were performed using two different tree-building methods, maximum likelihood (ML) and Bayesian Inference (BI). Maximum Likelihood analyses were performed using RAxML version 8.0.2 69 with the GTRGAMMA model and 1,000 rapid bootstraps to assess nodal support. BI was performed based on the concatenated data set using MrBayes version 3.2.4 70 . Before running MrBayes, the complete and concatenated datasets were run through jModelTest 3.7 71 and the AIC was used to select the best model. The best model for the complete dataset was GTR+I+G (shape = 0.7050, pinvar = 0.81), and the best model for the 13 PCGs and two rRNAs was TVM+I+G (shape = 0.8920, pinvar = 0.834). The BI used two independent runs with 8,000,000 generations in each and four chains. Each chain was sampled every 1,000 generations with a burn-in of 25%. The remaining trees were condensed using a 50% majority-rule consensus tree with posterior probabilities.
We also assessed the evolutionary distance between A.m. capensis and A.m. scutellata using uncorrected p-distances among unique haplotypes of the 13 PCG and two rRNA data set with MEGA7 72 .
Population genetic analyses. Our initial results suggested there could be some bias due to mis-alignment in the analysis of the complete dataset. Therefore, we focused population genetic analyses on the 13 PCGs and two rRNA dataset. A haplotype network was generated using the median-joining method with default settings in NETWORK v. 4.6.1.3 73 for the concatenated dataset (13 PCGs and two rRNAs) from the 39 samples (25 sequenced from the RSA, and 14 previously published mitogenomes from Africa and Europe). All networks were visualized and manually adjusted to ensure sufficient resolution. We computed the haplotype diversity (HD), nucleotide diversity (π), number of haplotypes (H), number of variable sites (V), number of mutations (M), average number of nucleotide differences (K) and neutrality indices using DnaSP v5 74 .