60,000 years of interactions between Central and Eastern Africa documented by major African mitochondrial haplogroup L2

Mitochondrial DNA (mtDNA) haplogroup L2 originated in Western Africa but is nowadays spread across the entire continent. L2 movements were previously postulated to be related to the Bantu expansion, but L2 expansions eastwards probably occurred much earlier. By reconstructing the phylogeny of L2 (44 new complete sequences) we provide insights on the complex net of within-African migrations in the last 60 thousand years (ka). Results show that lineages in Southern Africa cluster with Western/Central African lineages at a recent time scale, whereas, eastern lineages seem to be substantially more ancient. Three moments of expansion from a Central African source are associated to L2: (1) one migration at 70–50 ka into Eastern or Southern Africa, (2) postglacial movements (15–10 ka) into Eastern Africa; and (3) the southward Bantu Expansion in the last 5 ka. The complementary population and L0a phylogeography analyses indicate no strong evidence of mtDNA gene flow between eastern and southern populations during the later movement, suggesting low admixture between Eastern African populations and the Bantu migrants. This implies that, at least in the early stages, the Bantu expansion was mainly a demic diffusion with little incorporation of local populations.

Scientific RepoRts | 5:12526 | DOi: 10.1038/srep12526 analysis for haplogroup L0a (also present in Central and Eastern Africa by the time of the Bantu expansion 2 ) and a HVS-I population-based approach.

Results
Phylogeography of haplogroup L2. A schematic tree of haplogroup L2 is shown in Fig. 1. The complete phylogeny is shown in Supplementary Table 1, including ρ age estimates (considering both the complete mitochondrial genome and the synonymous clocks), ML age estimates and Bayesian age estimates based on a relaxed molecular clock for the main nodes. We tested the molecular clock with a likelihood ratio test 30 , which confirmed previous evidence of clock violation for this haplogroup 26,31,32 . In this sense, the Bayesian molecular clock, which allows rate variation, is more accurate. However, ML age estimates are also restrained by the tree structure and the overall mtDNA clock, that generally does not show strong violations 33 . ML age estimates are comparable to the Bayesian estimates. This is seen by the correlation between the estimated branch lengths between both analyses (ratio close to 1) ( Supplementary Fig. 1a). Nevertheless, when comparing the age estimates themselves, for younger nodes the Bayesian analysis provides substantially higher age estimates ( Supplementary Fig. 1b). Considering that the clock used for Bayesian inference was based on the age of haplogroup L3 (~70 ka) 12 , probably inappropriate for the period of interest (mainly Holocene), and the similar estimated branch lengths in the Bayesian inference and ML, we consider that the time dependent clock employed in the ML analysis containing the correction for purifying selection developed by Soares et al. 23 is more appropriate for such short time scales, particularly for ages below 20-15 ka ( Supplementary Fig. 1b), as it showed previously a good correspondence for the recent colonization of the Pacific 34 . All age estimates for the L2 clades mentioned in this chapter are shown in Table 1.
In general, L2 has a very complex structure. L2 divides into five main branches (L2a-e) (Fig. 1). The earliest split of L2 (99.1 ka in ML; 78.6 ka in the Bayesian inference (BI)) is between L2e and L2a-d (L2a'b' c' d). Regarding L2e (38.9 ka in ML and 34.3 ka in BI (Table 1)), both the frequency distributions (Fig. 2e) and its most basal branches (Supplementary Table 1) suggest a Western African origin. A southern sub-branch, labelled in this study as L2e1a1a, dating to about 2 ka, was probably involved in the Bantu migration.
L2a splits from L2b-d (L2b' c' d) at ~93.5 ka in ML and ~73.7 ka in BI. L2a (~84.4 ka in ML; ~66.1 ka in BI), is geographically widespread and highly frequent throughout Africa (Fig. 2a) and accounts for more than 70% of all L2 branches 18,26 , with peaks of frequency in Ghana, Sudan and Mozambique. L2a divides into five branches (L2a1-5), with the earliest split between L2a5 and L2a1-4. L2a5 is practically restricted to Southern Africa but also detected at lower frequencies in Eastern Africa. It dates to ~56.2 ka in ML and ~46 ka in BI and suggests a migration into Eastern or Southeast Africa between ~95 ka and 45 ka (considering the range of age estimates for L2a and L2a5). Since it is found only in Bantu speakers, this clade was probably assimilated by Bantu migrants. L2a1-4 splits between L2a1 and L2a2-4. L2a2'3'4 (~36.5 ka in ML; ~32.3 ka in BI) seems to have a more Central African distribution, with a clear association to Pygmy groups. The major split with other regions is within L2a4 (27.6 ka in ML and 23.6 ka in   35,36 ), and the exclusively European L2a1k 37 . Phylogenetic reconstruction of L2a1 is often difficult due to high levels of homoplasy. Major splits within L2a1 defined by homoplasic positions (143, 16189, 16192 and 16309) exist for parsimonious reconstruction purposes but will not be considered in the text. L2a1a has clearly a Western/Central African origin and distribution, with many sub-clades suggesting a recent Bantu migration southwards, and is hardly present in Eastern Africa. This pattern is also visible in L2a1c, L2a1f and L2a1i. L2a1e and the minor clade L2a1m exist essentially only in Western/Central Africa. L2a1l displays a similar pattern in sub-Saharan Africa, but with the peculiarity of a sub-branch present in Ashkenazi Jews, L2a1l2a 35 . L2a1b again shows an origin in Central Africa, but subclade L2a1b1a dating to 6.9 ka in ML is present in Southern Africa and has a few lineages in Eastern Africa (mainly Somalia). It might have moved earlier to the East in the Early Holocene and incorporated later by Bantu migrants. L2a1d splits into an Eastern African sub-clade (L2a1d1) at ~10.6 ka and L2a1d2 that shows a split between a Western African lineage and a Southern African clade dating to about 7 ka that contains the star-like L2a1d2a clade dating to 3.7 ka. Other clades show additional evidence of an early migration into Eastern Africa, like L2a1h and L2a1j. We detected a new clade specific to Somalia, L2a1r, at 7.3 ka. The clade L2a1 + 143 shows several basal Eastern African lineages (together with Near Eastern and Arabian lineages) that indicates a migration in the Early Holocene. Minor clades, namely L2a1g and L2a1q, are present in Bantu-speaking populations in the South and, although they were not detected in Western/Central Africa, their lower age suggest a direct involvement in the Bantu expansion.
The major split within L2b' c' d is between L2b' c and L2d (~86.7 ka in ML; 65.1 ka in BI). L2d (~19.0 ka in ML ~16.2 ka in BI) is a rare clade dominated by basal western branches, supporting the overall origin of L2 in Western Africa. Lastly, L2b and L2c split at 65.3 ka in ML and 49.7 ka in BI. L2b (~26.0 ka in ML and ~24.5 ka in BI) displays four branches, all with probable origin in Western/Central Africa. Two sub-branches, L2b1a3 and L2b2a, could have been involved in the Bantu expansion. L2c (~17.2 ka in ML; ~18.3 ka in BI) is essentially western as reported elsewhere 38,39 , reaching the highest frequencies in Gambia, Sierra Leone and Cape Verde (Fig. 2c). Two branches, L2c2a1 (5.9 ka in ML) and L2c2b1b (~2.5 ka in ML) are associated to southern populations and most likely related to the Bantu expansion. In order to assess if the movements observed in the L2 tree were accompanied by the increase in the effective population size (Ne) associated to the haplogroup, we computed Bayesian Skyline Plots (BSPs). The BSP for total sub-Saharan African L2 dataset shows two moments of increment in the N e associated to L2 (Supplementary Fig. 2a): ~11.5 ka, corresponding to the Pleistocene/Holocene transition and ~5 ka, probably associated to the Bantu expansion.

Early Holocene gene flow between Central and Eastern Africa. In a previous study based on
HVS-I, a probable major migration of L2 lineages in the Pleistocene/Holocene transition was discerned 13 . We reproduced the network analysis for the present HVS-I dataset and the three main nodes detectable at HVS-I level (L2a root, L2a + 16189, L2a + 16189 + 16192) showed founder ages 40 between 7 ka and 15 ka, supporting a movement into Eastern Africa substantially before the Bantu expansion. L0a also showed a similar pattern but in the opposite direction 2 , suggesting that bidirectional gene flow occurred between Central and Eastern Africa in the Pleistocene/Holocene transition period. Since phylogeographic inferences based on HVS-I alone can be misleading we further tested the patterns with whole-mtDNA genomes.
From the 801 complete L2 sequences contemplated in our analysis, only 39 are from EA. Considering that 24 were sequenced in this study, it is evident how poorly this region was represented in past studies. L2 complete sequences, similarly to HVS-I, indicate that most lineages arrived to Eastern Africa in the Early Holocene or Late Pleistocene. L2a1b contains Somali lineages, whose founder age in Eastern Africa is 7.9 ka [1.5; 14.5]  A BSP restricted to Eastern Africa shows a N e increase in the Pleistocene/Holocene transition ( Supplementary Fig. 2c), despite the low number of eastern samples. A signal of expansion was also visible at ~10 ka in the BSP containing Western/Central African lineages ( Supplementary Fig. 2b) which could be expected considering that the migration of L2 lineages into Eastern Africa might have been triggered by an expansion of these lineages in the source.

Bantu expansion.
Most of the typically Eastern African sequences (many L3, L4, L5 and L6 clades) are not frequent in Southern Africa, whereas L0a and L2 are also very common in Southern Bantu-speakers (Fig. 3). Considering their wide distribution across Central and Eastern Africa, probably established in the postglacial period, L2 and L0a southern clades could have different origins: trace their immediate origin to Central Africa in the last 5 ka, indicating they were carried by Bantu agriculturalists; or coalescence with Eastern African branches, suggesting direct gene flow from non-Bantu populations to the migrating Bantu-speakers in Eastern Africa (possibly in the Great Lakes area).
Taking advantage of the recently well characterized Southeast African whole mtDNA pool 41-43 , we identified that most (88%) L2 Southeast African branches perfectly mirror the Bantu expansion, deriving from Central African clades in the last few millennia and mostly displaying a star-like pattern, a signal of a recent expansion. Examples of such clades are found in L2e1a1a (1.8 ka), L2c2b1b (2.5 ka) and L2a1a2a1a (5.3 ka). These clades do not display any Eastern African representatives (Supplementary  Table 1), or in the few exceptions the eastern samples result of gene flow from settled eastern Bantu populations, rather than the other way around.
There are, nevertheless, episodic minor southern clades presenting an Eastern origin (e.g. the minor L2a1 h). In L2a1d2 there is a Southern African branch deriving from Western Africa and containing the major sub-branch L2a1d2a (labelled in this study, ~3.7 ka), with very strong star-like pattern. However, one Zambian sample places L2a1d2a ancestry in Southern Africa at 7 ka. Considering the overall pattern, either a closer Central African representative is missing by chance, or the possibility of sequencing errors in the Zambian ancestral sample cannot be excluded.
But the most singular case representing the bulk of southern L2 lineages not displaying a Central African Bantu origin is L2a5 (56.2 ka in ML; 46 ka in BI). L2a5 is present essentially in Southeast Africa, despite having been also detected in Eastern Africa in the HVS-I dataset. Its split with other L2a lineages dictates the root of L2a and suggests a movement from Central Africa to Eastern or Southeast Africa between the age of L2a (84.4 ka in ML, 66.1 ka in BI) and the age of L2a5. Movements from Eastern to Central Africa, probably associated to climate change 44 , might have carried ancestral lineages of L3e and L3b' d around 50-40 ka 12 and it is not unlikely that movements occurred in both directions. This clade being incorporated by Bantu-speakers represents the major detected input of autochthonous lineages (either in Eastern or in Southeast Africa), apart from L0d and L0k further south. One possibility is that this lineage was already present in the early Bantu populations moving South after the standing point in the Great Lakes in Uganda. Since we have no whole mtDNA information on Uganda we cannot exclude it as the possible origin of L2a5.
The BSP for Southern Africa (Supplementary Fig. 2d) shows a rapid increment in L2 N e ~2.5 ka, consistent with the results of the overall L2 data and with a rapid increase during the Bantu expansion. Again, the signal was also detectable in the BSP for Western/Central Africa (Supplementary Fig. 2b) which presents a second peak at ~5 ka ( Supplementary Fig. 3). The Southern African dataset is the only one with a small increment between 60 and 70 ka ( Supplementary Fig. 3), which could be due to the before-mentioned L2a5 in the region.
Like L2, the distribution of L0a was for a long time linked to Bantu movements, but recent evidence supports an earlier expansion of L0a to Central Africa during the Pleistocene/Holocene transition 2,28 and a later incorporation in the Bantu expansion southwards. An updated L0a phylogeny is shown in Supplementary Table 2. Time to the most recent common ancestor (TMRCA) estimates for the branches mentioned in the text are also shown in Table 1. L0a shows great parallelism with L2, being the main difference the principal direction of the movement in the postglacial period (L0a originated in Eastern Africa). Nearly 75% of the Southern African L0a lineages show a Central African origin (Supplementary  Table 2), despite the eastern origin of this haplogroup. Clades like L0a1a2 and L0a1b1a have a Central African origin and an association with the Bantu expansion, without the involvement of Eastern African lineages. However, one frequent branch, L0a2a2a, shows evidence of an Eastern African origin and probably represents the assimilation of Eastern lineages by the migrating Bantu groups.

Population analysis.
In order to confirm the dynamics of Bantu and Eastern populations observed within haplogroup L2, we performed a HVS-I population-based analysis. The MDS plots ( Fig. 3 and S4) display Young's S-stress values that guarantee that they are accurately portraying relationships between the populations 45 . The first dimension (Fig. 3) does not consistently differentiate any comprehensive group based on geography or language. Nevertheless, it separates Western from Central Africa, with Southern Africans grouping mainly with Central Africans. The most divergent groups, which keep their differentiated positions in the MDS excluding L2a and L0a ( Supplementary Fig. 4), are the Kuvale, Bantu-speakers with a seminomadic pastoralist lifestyle 46 , and the Fwe, who have incorporated click consonants, typical of Khoisan idioms, into their language 47 . Previous studies showed admixture of these two groups with Khoisan neighbours 41,46 .
The second dimension (Fig. 3) separates Eastern African groups from the other populations. The most divergent group on this dimension is El Molo, a Kenyan Cushitic group known for its genetic isolation 48 . If a line was to be drawn between 0.2 and 0.3 of this dimension it would separate Eastern populations from Bantu groups, suggesting low intrusion of eastern African lineages into the expanding Bantu-speakers to the South, which otherwise would show a greater proximity to Eastern Africans. On the border of this hypothetical line lies Sudan and Luhya (LWK), from Eastern Africa, and three southern Bantu populations, Kunda (Zambia), Shona (Zimbabwe) and Nyaneka (Angola). A detailed assessment of haplogroup composition was performed ( Fig. 4 and S5). Nyaneka displays a haplogroup composition typical of Bantu populations and absence of Eastern lineages, but a fairly high frequency of L0a (~20%) (Fig. 4 and S5). In the second MDS ( Supplementary Fig. 4), excluding L2a and L0a haplotypes, this proximity to Eastern Africa was no longer visible. Shona presents non-L lineages (~5%) and some typically eastern lineages, like L4, at very low frequencies (~3%) (Supplementary Fig. 5). Kunda harbours lineages labelled as L0*(~5%) at a frequency comparable to Eastern Africa ( Fig. 3 and Supplementary Fig. 5). Overall, these three populations have almost an entirely Central African Bantu ancestry and there is little evidence of contact with Eastern populations.
On the other hand, the Luhya and Sudan are the eastern groups with the closest positions to southern Bantus. The Luhya, a Kenyan Bantu-speaking group, seems to display a dual Bantu/East African maternal ancestry. It harbours a high proportion of L3b (typical of Western Africa) and, on the other hand, L0a, L4 and L5 at frequencies comparable to Eastern Africa ( Supplementary Fig. 5). This suggests Bantu genetic input into Eastern Africa (as expected since the Luhya are Bantu speakers) but not the opposite. The position of Sudan is essentially due to its high frequency of L2a ( Fig. 2a and 4), since its proximity to Western and Bantu populations disappears when excluding L2 sequences from the analysis (Supplementary Fig. 4). L2a reaches a peak in Sudan (~30%), much higher than the average for Eastern Africa (~12%) and more similar to Western Africa ( Supplementary Fig. 5). This proximity was probably established in the postglacial period, based on our phylogeographic analysis performed here.

Discussion
We extensively analysed the phylogeography of major African mtDNA haplogroup L2. Following an origin in Western or Central Africa, L2 was involved at least in three movements out of Central Africa: (1) in the Pleistocene, at least 50 ka ago, documented by the emergence of L2a5 either in Eastern or Southern Africa whose arrival could match a period of climate change that could have also triggered the OOA migration and Eastern to Central African movements 12,44 . (2) in the postglacial period comprising the Late Pleistocene and Early Holocene (a period that witnessed major changes in climate and vegetation 6,10 ), comprising mainly the expansion of L2a throughout Eastern Africa, in the opposite direction of L0a haplogroup 2 . (3) in the Late Holocene, L2 clades, mostly L2a clades, were deeply involved in the Bantu expansion to Southern Africa.
Although archaeological evidence suggests that the Bantu dispersals towards Southern Africa paused for a few hundred years at the Great Lakes region and some gene flow with Eastern populations into the Bantu arriving group scan therefore be expected before they expanded South, specific eastern mtDNA lineages (like L4, L5, L6 and several L3 sub-clades) are not detected at considerable frequencies in Southern Bantu-speaking populations. MtDNA haplogoups L0a and L2 were present in Central and Eastern Africa at least since the Early Holocene and they could have moved to the South from a dual source. However at whole mtDNA genome resolution, we could see that at least three quarters of L0a and L2 in the south were originated in Central Africa in the last 5 ka. This suggests a low maternal Eastern African ancestry in the Southern Bantu populations, confirmed by genetic distances on the HVS-I population level.
Y-chromosome evidence also indicates little gene flow in the paternal component between eastern populations and the Bantu migrants 49 . Uganda represents currently the major gap for mtDNA sampling and, since it lies in the transition between Nilotic and Niger-Kordofanian languages, it could provide valuable insights on the dynamics of Bantu and non-Bantu populations in the Great Lakes region. There is however Y-chromosome information for Uganda but only regarding Nilotes 49,50 , who present low proportion of shared haplotypes with Bantu neighbours, suggesting at least low gene flow from Bantu to Nilotic people. Genome-wide data suggests a shared component across most of Bantu populations 51 . A recent high resolution characterization of African populations supports this same figure 52 . At a likely estimation with six components (K = 6), the dominant Western/Central component prevails at Bantu-speaking populations in Southern Africa (admixed with the component found in Khoisan populations) and Eastern Africa, where it is admixed with an eastern component that was basically not detected in the south 52 . However, autosomal information from a typical Southeast African Bantu population is still unavailable and so it is difficult to assess an Eastern African input in the eastern Bantu route into Southern Africa at the genome-wide level.
The Bantu expansion is often placed within a group of linked theories proposing demographic expansions, associated with a specific language and triggered by the advent of agricultural practices called the Language-Farming dispersal hypothesis 16 , which includes the spread of Indo-European in Europe and Austronesian in Island Southeast Asia (ISEA) and the Pacific. While genetic patterns of current European and ISEA populations are difficult to reconcile with a demographic diffusion model associated with language and agricultural dispersal 9,53 , African Bantu-speaking populations display a genetic signature that allows linking them to a Central African origin. While genetic incorporation of autochthonous diversity into Bantu-speaking populations certainly occurred during the last two millennia throughout Africa 41,43,46 the phylogeography of mtDNA haplogroups L2 and L0a as well as the population analyses performed here suggest that that incorporation was minimal during early Bantu expansion through the Eastern Route that continuously mimics the expansion of its Central African ancestral population, even when considering that Bantu-speakers were stationed in Eastern Africa for a few hundred years before their migration southwards.
Understanding worldwide population history is essential for studies on disease susceptibility, adaptation and pharmacogenetics and Africa as the cradle of modern humans and the most genetically diverse continent plays a central role in this genetic characterization 52 . Central Africa that, we hypothetically placed before as a likely point of origin of the modern humans 2 , was also the source for migrations in the Late Pleistocene/Early Holocene and more drastically a demic diffusion in the last few millennia that carried both signals of genetic adaptation developed in Central Africa as well as eventually recently developed pathogenic mutations 54 throughout the continent.

Methods
Sampling, mtDNA sequencing and haplogroup affiliation. We targeted 44 samples representative of the diversity of haplogroup L2 in different African regions for complete mtDNA sequencing: four individuals from Ethiopia, 12 from Somalia and nine from Sudan (a total of 25 samples from Eastern Africa), 14 from Mozambique (Southeast Africa) and five from São Tomé and Príncipe (Western Africa). All the samples belonged to unrelated individuals who gave informed consent. Sudanese and Ethiopian samples were collected from emigrants in Dubai and Somali samples were from refugees in Yemen. Samples from Mozambique and São Tomé and Príncipe were collected locally, in Maputo and Príncipe Island, respectively. The work was approved by the Ethics Committee of the University of Porto (11/CEUP/2011).
We performed amplification and Sanger sequencing reactions as reported previously 55 . We compared sequences to rCRS 56 using Geneious v.5.4 57 and we manually checked and annotated polymorphisms according to the nomenclature in PhyloTree (Build 16, February 2014) 58 . We combined these sequences with published complete L2 mtDNA sequences for a total of 801 L2 complete sequences (Supplementary Table 3). For comparative purposes, we also performed a reanalysis of L0a phylogeny 2 (based on published whole-mtDNA sequences), with a total 303 mtDNA genomes (Supplementary Table 4). We used MtDNA GenSyn software 59 to convert sequences into haplotypes. Sequences are available at GenBank with accession numbers KR135841-KR135884.
Phylogenetic reconstruction. Phylogeny was reconstructed based on a preliminary reduced-median network analysis with Network v.4.611 60 , which suggested a branching order that was manually constructed considering the frequency of each mutation as reported 23 and the additional guidance of PhyloTree (Build 16).
In order to estimate the TMRCA of L2 and L0a internal clades, we used both rho (ρ) statistics and maximum likelihood (ML) analyses. In addition to the positions removed for phylogenetic reconstruction, we also excluded all indels for the following analysis, since this type of variation is not considered by the models used for age estimation. ρ statistics 61 estimates the average of mutational steps from a given ancestral node to the tips of the phylogeny purely based on a given mutation rate, not including any evolutionary model. The mutation rate applied was one substitution every 3,624 years and corrected for purifying selection and the synonymous mutation rate was one substitution every 7,884 years 23 . Standard errors were estimated realistically 62 . We performed ML estimates of branch lengths using PAML 4 63 . We assumed the HKY85 mutation model as done previously 23 with gamma-distributed rates (discrete distribution of 32 categories). We considered two partitions so as to differentiate the fast evolving HVS-I and HVS-II from the rest of the molecule.
Evidence of violation of the molecular clock was previously reported for African haplogroups, mostly L2 26,31,32 . We conducted the PAML analysis for this haplogroup both with and without a molecular clock and performed a likelihood ratio test, which indicated deviations to the molecular clock. In addition, we estimated ages of L2 internal nodes using BEAST v.1.8.0 64 (100,000,000 interactions with a burn-in of 10,000,000 steps), applying both a strict and a relaxed molecular clock (which allows rate variation across lineages) and a mutation rate of 2.6186 × 10 −8 substitutions per site per year (calculated previously for haplogroup L3 12 ). We compared both analyses by calculating a Bayes factor, which showed very strong differences 65 , supporting the use of a relaxed clock for L2. However, when we compared calculated branch lengths in the Bayesian inference and ML they were extremely similar.
We assessed variations in the effective population size (Ne) associated to haplogroup L2 over time with Bayesian Skyline Plots (BSPs) 66 , also obtained with BEAST v.1.8.0 and visualized with Tracer v.1.6 (http://beast.bio.ed.ac.uk/Tracer). Haplogroup L2 does not equate to population data, but a BSP applied to a specific lineage is expected to provide insights into the increments of the population associated with that lineage. This kind of approach has been performed before with complete mtDNA sequences for various haplogroups with satisfactory results 2,12,28,35 . We assumed a 25-year generation time 67 . We organized the samples in major monophyletic groups to resemble main subhaplogroups (L2a, L2a1, L2b, L2c, L2d and L2e), in order to guarantee a tree structure similar to our phylogenetic reconstruction and allow direct comparison among different analyses. Apart from the BSP for the entire sub-Saharan African dataset of complete L2 sequences, additional regional BSPs (Western/Central, Eastern and Southern Africa) were computed.
We additionally performed reduced-median network analysis 60 of haplogroups L0a and L2 (based on HVS-I). We calculated founder ages of L2a main nodes in Eastern Africa considering the mutation rate previously calculated for HVS-I 33 scaled to the size of the fragment considered (position 16090 to 16365).

Frequency distribution maps and population comparisons.
In order to visualize their geographic distribution within Africa, we constructed frequency distribution maps (based on HVS-I data) for major L2 subhaplogroups (L2a, L2b, L2d, L2e and L2*) with Surfer ® v.8 (Golden Software) using Kriging algorithm. L2c is not distinguishable by HVS-I polymorphisms alone 18 , therefore we calculated its frequency as L2*. The dataset included 13910 HVS-I samples from 39 different countries (Supplementary Table 5). This dataset contains published HVS-I African sequences, plus the HVS-I segment of complete mtDNA sequences from 1000 Genomes 68 and recent population studies [41][42][43][69][70][71] .
From the dataset used to compute the frequency distribution maps, we considered only populations with more than 30 individuals and with ethnic and/or linguistic information available for population-based analysis. However, the general populations that were sampled for the L2 phylogeography analysis (Ethiopia, Mozambique, São Tomé and Príncipe, Somalia and Sudan) were also included. Since the main goal was to infer the relationships between sub-Saharan groups (particularly between Bantu and Eastern groups), populations from North Africa, Pygmy and Khoisan groups were not included.
We . We merged same ethnic groups from different countries and/ or different studies, as well as different ethnic groups from the same country and we considered them as one unique population whenever F ST was not significant in a preceding analysis. We always applied Bonferroni correction to p-values. The final MDS plot includes 55 populations (Supplementary Table 6 corresponding to 4880 individuals. In addition, we assessed genetic distances excluding L2a and L0a sequences and we constructed a second MDS plot, based on a total of 3323 individuals. We assessed mtDNA haplogroup composition in Western, Central, Eastern and Southern Africa (both by region and by country) (Supplementary Table 7).