Introduction

Confidence in species identification provides insights into the evolutionary forces that drive species diversification, and it underpins the strategies for biological conservation and the management of natural resources. While morphological differences have enabled many species to be identified, cryptic species complexes present a particular challenge due to their general lack of readily distinguishable morphological characters1,2,3. Ecological and life history data including host plant use and mating behaviour are needed to mitigate and address these challenges3,4,5. This is often time-consuming and impractical especially with large numbers of samples across wide geographic range. Incorporating DNA markers, e.g., diagnostic mitochondrial DNA (mtDNA) markers6,7,8, genome-wide sequencing9,10, and whole-genome resequencing11 can increase confidence in differentiating cryptic species in a complex and identifying interspecific hybrids.

The challenges can become critical when individual cryptic species in a complex have a serious impact on plant health or agricultural productivity: insect vectors of plant viral diseases affecting cassava (Manihot esculenta), the most important food crop grown in Africa12, is one example of such case. Cassava production is vulnerable to cassava mosaic disease (CMD) caused by a number of cassava mosaic geminiviruses (CMGs)13,14 and CMD has been reported in Africa since the early 1970s15 and more recently in South-East Asia16,17. While the vector responsible for the spread of the CMD in South-East Asia is not well-understood, the pandemic in east and central Africa was associated with the distribution of sub-Saharan African (SSA) species of the Bemisia whitefly complex13. Invasive whitefly species such as the Mediterranean (MED), Middle East-Asia Minor 1 (MEAM1), and SSA2, have become pests of global agriculture, spreading plant viral diseases and becoming resistant to chemical control agents18,19. In particular, the endemic African cassava Bemisia whitefly SSA2 species occupies regions affected by the CMD pandemic. SSSA2 has also expanded into southern Europe since 200720 whereas SSA1 has not (though widespread throughout the African continent). Defining the actual CMD vectors in Africa remains a challenge due to the cryptic nature of the many species in the Aleyrodidae whitefly genus Bemisia3,18,21,22.

Historically, the lack of distinguishable morphological characters among Bemisia tabaci has resulted in geographically distinct species being reclassified as a single species23,24,25 and underpinned the B. tabaci taxonomic confusion (e.g., review by De Barro et al.18; see also Tay et al.26,27). Currently, over 38 cryptic species are recognised within the B. tabaci (Gennadius) species complex24 through molecular diagnostics of a partial (i.e., 657 bp) mitochondrial DNA cytochrome oxidase subunit I (mtCOI)18,28,29, mating experiments5,30,31 (although low numbers of F1 hybrids10,32 including between SSA1 and SSA2 (previously ‘Ug1’ and ‘Ug2’)33,34 are known), behavioural4, and host preference3 studies.

Re-analyses of partial mtCOI sequences relating to the MED species3, and of the B. tabaci standard partial mtCOI dataset33 demonstrated that ca. 60–65% were nuclear mitochondrial DNA sequences (NUMTs) or other PCR artefacts10,29,35. These analyses, together with molecular characterisation of original voucher specimens, have begun to disentangle the species complex26,27, identified historical species ranges26,27,36, and provided phylogenetic evidence that the ‘B. tabaci cryptic species complex’ is a polyphyletic group derived from various SSA Bemisia whitefly species’29. Delimitation of populations across the full African Bemisia species habitat range will be needed. This would help to ascertain whether any of these species, e.g., SSA1, is genetically distinct with limited gene flow from others either due to external (e.g., geographic or climatic limitation) or internal (e.g., presence/absence of secondary symbionts) factors including host association.. Recognition of such taxonomic classifications and their distribution in sub-Saharan Africa will be important for management of these pest species that could pose a threat to plant health and global agricultural biosecurity37.

The advent of high-throughput sequencing (HTS) technologies has driven the availability of genome-wide data that is further transforming the Bemisia taxonomy. HTS has demonstrated that various species previously recognised via characterisation of mtCOI (e.g., SSA429; Middle East Asia Minor 2 (MEAM2)10,29,35) were likely artefacts caused by the sub-optimal primers16,17 and sequencing of NUMTs10,35. Analyses of mitochondrial genomes also demonstrated inconsistencies of B. tabaci cryptic species identification based on the standard partial mtCOI gene region, and cautioned the need to combine behavioural and molecular evidence to delimitate species status3,31.

Despite the recent advances and their importance in the African agricultural landscape, the taxonomy, population structure, and evolutionary genetics of the African cassava Bemisia whitefly species complex (i.e., SSA1, SSA2, SSA3, SSA6, and SSA9)29 remain poorly understood. A recent study of taxonomic structure and gene flow among genetically diverse SSA1, SSA2, SSA3, and SSA4 species38 described serious discrepancies between their results obtained from mtCOI and from nextRAD sequencing. This (the conflicting results of mtCOI and nextRAD sequencing) findings has strongly implies that the species delimitation of sub-Sahara African Bemisia species needs to be revisited since it is currently based on mtCOI. We seek to show why mtCO and nuclear DNA should result in such conflicting implications for the African cassava Bemisia whitefly SSA1 (which is also further refined into subgroups (SG) 1, 2, and 3 in, e.g., Legg et al.13), SSA2, and SSA4 species. With the cassava Bemisia whitefly being the most common whitefly on cassava crops in Africa, and capable of vectoring diseases (e.g., cassava mosaic begamoviruses, CMBs; cassava brown streak ipomoviruses, CBSIs) known to cause significant crop losses39, our findings will also provide insights into incipient speciation of this agriculturally important Bemisia species complex with the capacity to impact on food security in sub-Saharan Africa40,41.

Results

Wosula et al.38 reported a total of 25 partial mtCOI sequences from their 72 SSA samples studied. Of these 25 sequences, only 11 represented unique (non-redundant) mtCOI haplotypes [seven SSA1: MF417579, MF417581, MF417585, MF417587, MF417590, MF417592, MF417601; one SSA3: MF417591; one SSA2: MF417597; two ‘putative SSA4’ MF417589, MF417593 (with ca. 94% nucleotide identity to SSA3)].

Integrity of African cassava Bemisia SSA mtCOI sequences

Berry et al.42 reported 11 ‘Sub-Saharan III: Western Africa-Cameroon/Cassava’ sequences which were renamed as SSA418. Of these 11 SSA4 sequences, only two short sequences (i.e., Cam Ayos 1 WO2, Cam Ayos 2 WO3) did not have INDELs or exhibit any premature stop codons at the amino acid level (Suppl. Fig. 1). The remaining nine sequences had random INDELs that resulted in frame-shift mutations and premature stop codons, which suggested that they are likely to be NUMTs. The Cam Ayos 2 WO3 sequence also had nucleotide substitutions that resulted in amino acid residue changes in highly conserved mtCOI regions, suggesting that this was likely also a NUMT sequence. The remaining single, short but ‘functional’ Cam Ayos 1 WO2 sequence (AF344246) that phylogenetically grouped42 with the 10 NUMT-related SSA4 sequences suggested that as a whole, these SSA4 sequences originally reported by Berry et al.42 were all likely NUMTs. The fact that SSA4 sequences were phylog enetically clustered to SSA2 suggested they were also potentially of nuclear origin, similar to that reported for B. tabaci MEAM2 NUMT10,35 that originated from the nuclear genome of MEAM110,35.

Further evidence that the partial mtCOI gene lacks the power to differentiate between evolutionary very closely related cryptic species (e.g., the ‘subgroups’ (SG)13,43 taxonomic status within SSA1) can be seen by examining nucleotide substitution patterns. For example, re-assessment of the two SSA1-SG5 sequences (MF417585, MF417586) showed that they differed from the SSA1-SG3 sequence JQ286482 by only three C/T substitutions that resulted in 0.56% nucleotide difference (Suppl. Fig. 2A). The basis for suggesting SSA1-SG5 and SSA1-SG3 to be different ‘subgroups’ within the SSA1 species was therefore unclear. For example, sequences with similar (Suppl. Fig. 2B) or with greater (Suppl. Fig. 2C) numbers of nucleotide substitutions were placed within same subgroups, or potentially as the same species31 (Suppl. Fig. 2D).

Species delimitation and introgression

The genome-wide SNP-based Principal Component Analysis (PCA) discriminated between SSA1 and SSA2 sampled from eight sub-Saharan African nations (Fig. 1A), suggesting that SSA1 and SSA2 are genetically distinct and likely to be different species as supported by mating studies31,34. Samples from SSA4 clustered with SSA2 (Fig. 1B,C), suggesting that mtCOI-based SSA4 species identification was likely misidentification of SSA2 due to NUMTs29. Removal of NUMTs resolved the mtCOI discrepancies, and reanalyses of genome-wide SNP data provided insights into SSA1-species-wide population substructure, identified likely geographic contact zones, and signatures of infra- and inter-species hybridization.

Figure 1
figure 1

Sampling sites and principal component analysis of SSA Bemisia samples. Principal component analysis (PCA) was based on 14,358 genome-wide SNPs across the 62 sub-Saharan African cassava Bemisia samples from eight sub-Saharan countries. (A) African countries where whitefly individuals were sampled from. Country names and colours as (B). African map created using mapchart <mapchart.net>. (B) Country of origins for these SSA1 and SSA2 Bemisia species are shown by unique colour codes. Parentheses after country names are the ‘genetic groups’ as proposed by Wosula et al.38. (C) PCA showing clustering of individuals by their SSA1 or SSA2 species status from this study’s re-analyses of genome-wide SNPs.

Maximum-Likelihood (ML) phylogeny inferred using RAxML based on 14,358 genome-wide SNPs (Fig. 2A) also supported the PCA result and broadly divided these 62 SSA individuals into three clusters that consisted of: (i) SSA2 (i.e., individuals previously identified as SSA4 or SSA2 by mtCOI; 100% node confidence), (ii) SSA1-SE (i.e., ‘SSA-CA/SSA-ESA’ populations; node confidence 89%), and (iii) SSA1-NW (i.e., ‘SSA-ECA/SSA-WA’ populations; node confidence 96%). The phylogenomic analysis supported the SSA1 and SSA2 species status as consistent with results from partial mtCOI gene sequences (Fig. 2B,C). Two clades within SSA1, denoted as SSA1-SE (South-Eastern) and SSA1-NW (North-Western) (Fig. 2C), respectively, was evident, reflecting the geographic distribution of the populations identified within them (SSA1-SE individuals from DR Congo, Tanzania, Madagascar; SSA1-NW samples from Cameroon, Nigeria, DR Congo, Rwanda, Burundi, Tanzania). Within SSA1-NW, the ‘SSA-ECA’ and ‘SSA-WA’ populations38 are clustered separately with 99% and 96% confidence values. Within SSA1-SE, the ‘SSA-ESA’ and ‘SSA-CA’ populations38 are partially resolved, sharing only 34% confidence and suggesting insufficient genomic evidence to support their separate genetic group classification38.

Figure 2
figure 2

Phylogeny of SSA populations. (A) RAxML phylogeny with 1000 bootstrap replications based on 14,358 nextRAD genome-wide SNPs from 62 SSA Bemisia individuals and three ‘non-cassava individuals’38 as outgroup. Branch colour schemes (red, green, blue) are as used in Admixture analysis (see Fig. 3). Sample names and branch tip colour schemes (filled circles) followed Wosula et al.38, and where available, mtCOI GenBank accession numbers corresponding to selected samples, and their proposed mtCOI SSA1-subgroups (SG)13 status (i.e., SG1, SG2, SG3, SG544) are also listed. Individuals that have partial mtCOI gene sequence but have been excluded from the nextRAD genome-wide SNPs phylogeny analysis due to poor SNP data coverage are boxed. The Bemisia SSA2-SSA4 clade is indicated by green branches, and the SSA lineages as proposed by Wosula et al.38 [i.e., CA (pink), ESA (black), WA (green), ECA (red)] are here identified as belonging to the SSA1 species and together, made up the ‘SSA1-SE’ and ‘SSA1-NW’ putative sub-species. Note that only nodes with ≥ 75% bootstrap support are shown. Enlarged section of branch breaks separating SSA1-NW, SSA1-SE, and SSA2 are shown by the insert box. The long branch length (0.43) separating the outgroups and the SSA1/SSA2 clades has been shorten as indicated by ‘//’. (B) Radial phylogeny with sample names and lineage colours as for (A). (C) Detail of the RAxML phylogeny in (B) showing only the SSA1 (i.e., represented by all ‘SE’ and ‘NW’ individuals) and SSA2 (represented by SSA2 and ‘SSA4’ individuals) clades. Proposed genetic groups for the SSA1 species by Wosula et al.38 are also indicated (i.e., ‘CA’ and ‘ESA’, and ‘WA’ and ‘ECA’ within the putative SSA1-SE and SSA1-NW sub-species, respectively). Country codes are: Tanzania (TZ), Democratic Republic of Congo (DRC), Madagascar (MAD), Cameroon (CAM), Central African Republic (RCA), Nigeria (NIG), Rwanda (RWA), Burundi (BUR).

Admixture analysis identified three genetic clusters (Fig. 3) that followed the partial mtCOI gene-based species diagnosis (see also Suppl. Fig. 3), whereby the SSA2 species was clearly shown to be different from the SSA1 species. The SSA1 species was further shown to consist of two distinct genetic backgrounds corresponding to SSA1-NW and SSA1-SE (containing ‘blue’ SSA-WA and SSA-ECA, and ‘red’ SSA-CA and SSA-ESA, respectively, as defined by Wosula et al.38). Some SSA2 and all the SSA1-CA samples showed evidence for low levels of admixture.

Figure 3
figure 3

Admixture analysis at K = 3 of 62 individuals of sub-Saharan African Bemisia cryptic species based on 14,358 SNPs. Individuals were clustered into the SSA2 species group, as well as two distinct clusters (red, blue) that belonged to the SSA1 species. The two SSA1 clusters in the Admixture panel included the SSA-WA and SSA-ECA groups (34 blue individuals; blue = SSA1-NW), and six SSA-CA (blue-red/blue-red-green) and 12 SSA-ESA (red; SSA1-SE) individuals. Evidence of introgression between the red and blue SSA1 clusters were predominantly detected in SSA-CA individuals. Three SSA2/SSA1 inter-specific hybrids (CAM_IMB2_SSA4, CAM_IMB1_SSA4, DRC_KICKAL2_CA) were also detected. Note: Figure colour scheme matches the phylogeny branch colours of Fig. 2A–C.

Interestingly, Fst analysis (Table 1) supported SSA1-NW and SSA1-SE as potential putative sub-species of SSA1, with intermediate Fst estimates observed for all pair-wise comparisons at the inter-specific (i.e., SSA2 vs. SSA1; Fst = 0.286–0.334) level. This was particularly evident for SSA1-NW versus the SSA1-SE ‘ESA’ population, with Fst of 0.226–0.243. Fst differentiation was less clearly defined for the SSA-CA population, whose Fst scores versus SSA1-NW were closer to those between populations within the two major SSA1 groups. The overall differentiation between SSA1-NW and SSA1-SE suggested likely existence of putative sub-species within SSA1, the observations concerning the SSA-CA population suggest gene flow between neighbouring populations. Treatment of SSA1 as single species against the SSA2/SSA4 species showed significant population substructure (Fst range 0.286–0.332). This limited gene flow between SSA2/SSA4 and the SSA1 (i.e., SSA-WA, SSA-ECA, SSA-CA. SSA-ESA) supported their separate species status. Treatment of the four SSA1 populations as belonging to two well-defined groups (i.e., SSA-NW, SSA-SE) akin to ‘putative sub-species’, showed that SSA-WA and SSA-ECA overall shared intermediate levels of gene flow with the SSA-ESA population. Fst estimates are lowest between the SSA-CA population and the SSA-WA, SSA-ECA, and SSA-ESA populations, suggesting greater gene flow at the contact zone.

Table 1 Population substructure based on the F-statistics (Fst) of Weir and Cockerham45.

Within the putative SSA1-NW sub-species, the SSA-WA population is predominantly found in the lowlands (ca. 300–1500 m), whereas the SSA-ECA/SSA-CA populations were found at higher altitudes (from ca. 600–≥ 1500 m). Individuals belonging to the SSA-ESA group were predominantly from coastal landscape of Tanzania and Madagascar (Fig. 4). A gene flow contact zone was detected at regions represented by five DR Congo individuals and one Tanzanian individual (i.e., the SSA-CA population). Inter-specific hybridization was detected in two SSA2 individuals (Fig. 3, green/blue genetic backgrounds) and in one DR Congo individual from SSA-CA (red/green/blue in Fig. 3).

Figure 4
figure 4

A map showing sampling sites of the SSA Bemisia cryptic species. The sample colour schemes and sampling sites followed Fig. 6B of Wosula et al.38 (green: SSA-WA; red: SSA-ECA; pink: SSA-CA; black: SSA-ECA; Yellow: SSA2; blue: SSA4). Note: The SSA4 species is likely NUMT artefacts of the SSA2 species (refer to main text). The African topological map colour represents elevation from sea-level (see insert map of non-incremental approximation of elevation). The gene flow contact zone between populations on the NW–SE transect is indicated by the thick blue dashed line. Map modified and prepared using Microsoft PowerPoint for Mac, Version 16.39 (20071300). Sourced: https://commons.wikimedia.org/wiki/File:Africa_topography_map.png.

The Treemix ML tree supported the SSA4 species as part of SSA2, and the migration edges (m = 2) revealed gene flow in both directions, between SSA1 and SSA2 species from central Africa (i.e., SSA-CA), and supports detection of hybrid individuals identified in the Admixture analysis (Fig. 3). The weight of the migration edge going from SSA-CA to SSA2 is also higher than the one going the opposite direction (Fig. 5). TreeMix ML tree also showed a lack of branch depth between each cluster of SSA2/SSA4, ECA/WA, and CA/ESA, thereby provided further support that entities within the three clusters were identical.

Figure 5
figure 5

Gene flow between SSA species. (A) TreeMix ML tree with two migration edges (m = 2), showing strong gene flow from SSA1 to SSA2/SSA4 (red arrow), and weaker gene flow from SSA2 to SSA1. Clustering between ECA and WA, and between ESA and CA, provided support for their close relationships that could potentially indicate their respective putative sub-species status (i.e., SSA1-NW, SSA1-SE, respectively). Two basal branches leading to SSA1 populations (i.e., ECA, WA, ESA, CA) and SSA2/SSA4 further supported the populations analysed as consisted of two species (i.e., SSA1, SSA2). The scale bar indicates ten times the average standard error (s.e.). (B) The residual fit was plotted from the ML tree, which is the covariance matrix between each pair of populations. Residuals above zero are an indication of the populations most likely to be admixed.

Discussion

We combined genome-wide SNPs and mtCOI markers to gain evolutionary, ecological and population genomic insights into the sub-Saharan African cassava Bemisia whitefly species complex. By taking into consideration the impact of NUMTs, we showed no conflict in the same nuclear and mtCOI dataset in defining Bemisia species status. A thorough filtering and reanalysis of genome-wide SNP markers also detected signatures of incipient speciation and admixture in the Bemisia SSA1 species. While various evolutionary genetic studies based on the partial mtCOI gene have proposed the presence of SSA1 species subgroups13,46, we caution against the use of the mitochondrial ‘subgroup’ classification based on genome-wide SNP markers. Mating experiments were inconclusive in separating subgroups of SSA1 especially between SG1 and SG231, while for the first time, genome-wide SNP loci showed that the SSA1 species could potentially be defined as to consist of a north-western (NW) and a south-eastern (SE) putative sub-species, with a contact zone separated by the eastern highland and central/western lowland regions.

Wosula et al.38 adopted the nextRAD genome-wide SNP approach first demonstrated by Elfekih et al.10,47,48,49 for Bemisia species to investigate gene flow patterns and diversity in the SSA1, SSA2, and SSA4 African cassava Bemisia whitefly species complex, but overlooked the impact of NUMTs10,32,35 in published B. tabaci phylogenetic studies based on partial mtCOI sequences. This led to the conclusion that the mtCOI gene lacked efficacies at differentiating African Bemisia cryptic species. Increasingly, reanalyses of the B. tabaci cryptic species complex partial mtCOI gene dataset have detected the presence of significant NUMTs and pseudogenes3,29,35. These studies highlighted on-going challenges associated with species delimitation in the cryptic B. tabaci species complex. Similar to MEAM228,50, our analysis supported that the SSA442 was likely NUMT artefacts as previously proposed29. The misidentification of SSA4 underpinned all subsequent genomic interpretations of gene flow and evolutionary ecology in the cassava SSA Bemisia whitefly species38.

Significant genetic diversity in the B. tabaci species complex can obscure species/putative sub-species identification. For instance, high genetic diversity in the MED species based on nextRAD SNP data has been reported10; however, the MED-ASL as a separate species was only further supported by including also complete mitogenome, mating experiments3, and host preference studies5. Genomics and mating experiments similar to the MED-ASL studies3,5 showed that the SSA1-SG1 and SG2 were likely the same species, with SG3 potentially being different31 (i.e., the ‘WA’ clade; Fig. 2B,C). Similar studies are needed to provide additional biological support for various SSA Bemisia species subgroups51 including the putative SSA1-NW and SSA1-SE sub-species.

In developing the DNA barcoding approach, Hebert et al.52 showed that the COI gene is overall effective for species level delimitation due to the associated nucleotide substitutional rates for this gene (but see Hanemaaijer et al.53), while for fine-scale identification such as within species level, the use of genes with greater mutational rates were recommended. Partial mtCOI gene therefore lacks the power to differentiate between evolutionary very closely related cryptic species status, for which a genomic approach should be considered3,31. Subgroup classification within the SSA1 African cassava Bemisia whitefly species based on the mtDNA gene might be possible, provided that a gene region with an appropriate substitutional rate is identified. Nevertheless, there is currently no evidence for maternal lineage-based patterns to population dispersal behaviour, host adaptation, and fitness cost differentiation in the B. tabaci complex and the African cassava Bemisia species complex, and therefore the subgroup delimitation based on the partial mtCOI gene is not justified.

The SSA1-SG1, representing the major SSA1 mtCOI subgroup, was previously implicated in the CMD pandemic in East Africa13. Wosula et al.38 reclassified SSA1-SG1 individuals (e.g., SSA1-SG3, SSA1-SG544) into three SNP-based groups of SSA-ECA, SSA-CA, and SSA-WA, due to discrepancies between the 657 bp mtCOI partial gene and nuclear SNP markers. Based on the same nuclear and mitochondrial dataset, we showed that while separate phylogenetic signals were detected for the ‘WA’ and ‘ECA’ populations, a strong genetic difference between the ‘CA’ and ‘ESA’ populations was lacking, and we therefore advocate following established good practice51 to refrain prematurely adopting other nomenclature (i.e., those proposed by Wosula et al.38).

Although refined higher-level species status could help resolve virus transmission capacity and provide insights into understanding pest introduction pathways11,54, there is currently insufficient evidence in African cassava Bemisia whiteflies to link either the WA/ECA or CA/ESA populations to historical CMD outbreaks. The B. tabaci SSA2 (i.e., ‘Ug2’33) was assumed to be associated with the 1990s CMD epidemic, mainly due to its geographic distribution in relation to the front of the epidemic55. However, transmission underpinning outbreaks of CMD in the sub-Saharan Africa region is complex, with one factor being the sharing and propagation of diseased cassava cuttings between farmers. There was a recombination event between two geminiviruses that produced a more virulent hybrid56, however we have little historical information about how the viruses interacted with the different Bemisia cryptic species in terms of disease dynamics57. Evidence for bidirectional interspecific hybridization detected between SSA1 and SSA2 in this study, therefore may have implications to understanding the evolutionary genetics associated with plant virus transmission capacity by the cassava SSA Bemisia whitefly species complex. The role of the SSA2 species in the CMD epidemic nevertheless remained unclear, with the SSA2’s habitat range (east and west Africa) overlapping the SSA1 species22,38,55,58,59,60,61,62. The exact relationship between SSA1/SSA2 and outbreaks of CMD63 should be re-visited, taking into consideration current genomic evidence for the presence of SSA1/SSA2 hybrids. Furthermore, vector competencies of other common cassava whitefly species (e.g., B. afer) to different cassava viruses also need further investigation, an ideally include a genomic component.

B. tabaci sensu lato on cassava in sub-Saharan Africa is a species complex going through a continuous diversification process. The putative SSA1-SE sub-species that the Madagascan (SSA-ESA) and Tanzanian (SSA-CA) populations belong to are geographically isolated from the Nigerian (SSA-WA), DR Congo, and Burundi (SSA-ECA) populations with a hybrid zone between Tanzania, Uganda and DR Congo. The East African Rift Valley is a major geographical barrier to gene flow and provides the environment to promote incipient speciation and to enable recent radiations to thrive and evolve64, and could underpin incipient allopatric speciation in the SSA1. The “rift effect” acting as a gene flow barrier has been described in other insects (e.g., the maize stalk borer Busseola fusca65, the malaria vector mosquitoes Anopheles gambiae66,67) and vertebrates (e.g., birds68, frogs69). Specifically, the western branch of the Rift Valley, i.e., the Albertine Rift covering part of Tanzania, Burundi, Rwanda, Uganda and DR Congo, is a mountain chain situated at 1500–3500 m above sea level (Fig. 4) formed by volcanic activity since the plio-pleistocene. In addition to volcanic activities, climate change during this glacial period led to habitat fragmentation that promoted species diversification in the newly formed biogeographical zones70. Species diversification in the African Bemisia whitefly cryptic species system has also been explained by host plant associations and their ability to feed on diverse host plants and acquiring new ones71. Using RNAseq, Malka et al.72 provided insights into how the variation in detoxification gene expression levels might have conferred plasticity that likely also contributed to species diversification.

Finally, we note that our genome-wide SNP phylogenetic analysis showed weak support (bootstrap value 34%, data not shown) for closer evolutionary relationship between SSA2 and SSA1-NW, than between the two putative SSA1 sub-species. This closer phylogenetic relationship between SSA2 and SSA1-NW is likely to be unreliable given the weak bootstrap support, greater gene flow patterns between the SSA1 sub-species (Table 1), and the congruence of species status based on mtCOI molecular diagnostics. our study therefore further highlights the need for careful selection of outgroups that takes into consideration the evolutionary biological questions to be addressed (e.g., whether the target species were more closely related (i.e., between SSA1, SSA2, SSA3, SSA6, SSA9)29 or evolutionary diverged (i.e., between SSA1 and AsiaII_1)29). Data obtained for this population genomic reanalysis involved only limited number of target and outgroups individuals and species that represented significant evolutionary divergent lineages from the SSA1/SSA2 species38. The challenge of obtaining sufficient sequencing coverage that is necessary for filtering to obtain a robust set of genome-wide SNP markers for the Bemisia cryptic species complex10,47,48,49 known to have a large and complex genome organization73,74 will likely further enhance and perpetuate their on-going status as one of the world’s most damaging and challenging group of agricultural insect pest complex.

Material and methods

mtCOI data handling

The mtCOI sequences of Wosula et al.38 were downloaded from GenBank (MF417578-MF417602), including all published reference sequences used to identify the SSA1 ‘genetic groups’. These partial mtCOI sequences were imported into Geneious R11.1 and aligned with reference sequences (Suppl. Fig. 1) using the MAFFT alignment program75 with the following settings (Algorithm option set at ‘Auto’; Scoring matrix at 200PAM/K = 2; Gap open penalty and Offset values at 1.53 and 0.123, respectively). We examined all sequences for: (i) insertion/deletions (indels), (ii) presence of premature stop codons, and (iii) presence of unexpected amino acid substitutions at evolutionary conserved regions in the partial COI gene region29 as signatures of pseudogenes (NUMTs).

nextRAD data processing

The raw data from nextRAD sequencing38 was accessed from GenBank (SRP103541). The fastq sequences were screened using FastQC v.0.11.4 for quality control. The reads were trimmed by quality and reads longer than 151 bp was trimmed to this length using Trimmomatic76 version 0.36.

Mapping quality

We mapped the reads from each individual sample to the B. tabaci MEAM1 genome73 using Burrows-Wheeler Aligner (BWA) v.0.7.1277. The alignments used the BWA-MEM algorithm with default settings. The SAM files were converted to BAM output using Samtools78 v. 1.3.1. The BAM files were subsequently sorted, indexed and checked for quality and mapping percentages per scaffold.

SNP genotyping

SNPs were first called using a de novo approach in PyRAD v. 0.979. PyRAD is a pipeline designed for RADseq datasets that aims to capture variation at the species/clade level. It allows clustering of highly divergent sequences and takes into consideration indel variation. We also performed SNP calling in PyRAD, using reads mapped to the MEAM1 reference genome73.

Genetic clusters and tests for introgression

Genetic clusters within the dataset that consists of 62 SSA samples were investigated by Principal Component Analysis (PCA) using the SNPRelate R package80. We used ADMIXTURE v.1.3.081 on the 62-sample dataset to identify genetic clusters and estimate the genetic ancestry of the SSA samples. We ran the program with K ancestral clusters varying from 1 to 10 with each ‘K’ value repeated 100 times. A cross-validation test was performed to determine the optimal K value.

Phylogenetic analyses

The topology of the phylogenetic relationships between individuals/populations/sub-species and species within the dataset (62 SSA samples and three outgroup individuals (TZ_CHA1, TZ_CHA2, TZ_KIL1)) was examined using a maximum likelihood (ML) approach. The SSA species status were first identified using the mtCOI marker, then, a phylogeny was reconstructed using the nextRAD genome-wide SNPs excluding samples with low genotype quality (mapping quality < 10) to minimize biases that could potentially be introduced by missing data. We performed five replicates of phylogenetic reconstruction in RAxML v.7.2.882 using the GTR substitution model and GTRGAMMA as the GAMMA model of rate heterogeneity, with 1000 bootstrap replications. TreeMix83 v. 1.12 was used to identify genetic mixing, the history of population splits and admixture patterns. We ran simulations in order to infer the best-fit scenario for the genetic relationships between these cassava Bemisia whitefly populations involving up to 5 migrations events with 100 bootstrap replications (e.g., based on previous studies using this type of analysis49,84,85. The likelihoods were compared using the following command in R: tail -n 1 Data/data.frq.strat.tree.*.*.llik | cut -d \: -f 2 > Data/data. The decision to select the number of migration m = 2 is based on the residual fit matrix generated by the TreeMix program. For each migration (m = 0, m = 1…, m = 5), we generate a residual plot. When this plot is saturated at a given m, the best fit scenario and the best-fit ML tree that corresponds to the specific m number are obtained. If it is m = 2, then any number bigger than 2 would not add any more information. Programming scripts used in this study are available as Supplemental Data 17.