Introduction

The early development of agriculture triggered significant population growth, resulting in the expansion of early farming populations, along with the spread of language families in many parts of the world, including Africa.1 The many advantages of agricultural subsistence over foraging is a likely contributing factor to the rapid expansion of agriculturists and their languages during the holocene.2 A well-known example of this phenomenon in Africa is the expansion of the Bantu-speaking people (EBSP), which is thought, on the basis of linguistic evidence, to have started around 5000 years ago3 in the region on the border between modern day eastern Nigeria and Cameroon.4 It is widely accepted that there was an early split into eastern and western routes in which farmers first expanded east and also, within 1500 years, reached West-Central Africa. After that the expansion is thought to have taken two directions with one wave moving along the south-western coast (West-Bantu route) and the other moving further east, forming the eastern Bantu core by 3000 years before present (YBP). Subsequently, the expansion is thought to have continued along the south-eastern coast (East-Bantu route).5 In an alternative model, the split came later after passage through the rain forest.3, 4 The Bantu language family is distributed throughout most of sub-equatorial Africa and is the continent’s largest, both in terms of the numbers of individuals speaking it and its geographic spread.2, 6, 7 This level of linguistic homogeneity among geographically distant populations across sub-Saharan Africa supports the suggestion of rapid expansion. Because of the separation of the two major waves around 3500 YBP and the subsequent isolation of many groups, modern Bantu languages can be divided into West and East Bantu.8, 9 Non-Bantu languages (that do not belong to the Niger-Congo phylum, eg, Khoisanid languages in the southwest of the continent and a few Cushitic and Nilo-Saharan tongues in the northeast6) have survived in areas that have not experienced extensive inward migration of Bantu speakers. It is also suggested that although the Bantu-speaking agriculturists may have replaced, to a substantial extent, hunter gatherers in their path, they have also, in some places, co-existed and interbred with the original inhabitants.2

Archaeological evidence suggests that the early expansion of proto-Bantu speakers was associated with pre-Iron Age farming technology and did not involve smelting metals.3 The first evidence of metallurgy south of the Sahara was found at Nok in Nigeria and is dated to no earlier than 2500 YBP.10 Therefore, it is possible that with the aid of the new technology, further expansions may have occurred after the first dispersal of farmers. Because the Bantu languages on the eastern route are more homogeneous than those on the western route,11 it is reasonable to speculate that later expansions occurred mainly on the eastern route.

Early genetic studies of Bantu-speaking people were based on classical gene frequency data. Attempts were made to identify genetic relationships among EBSP groups in the context of Africa as a whole10, 11 (also see Supplementary Figure S112). The major finding of these studies was that genetic distances (FST) among all EBSP groups are much less than the average FST among West-African and Nilo-Saharan groups, indicating a considerable level of homogeneity among EBSP groups. More recently, based on over 1300 autosomal markers, Tishkoff et al13 showed that Bantu-speaking groups exhibit a considerable level of genetic similarity, a finding which is in good agreement with earlier studies mentioned above. The EBSP impact on African demography has, over the past decade, also been studied by analysing paternal and maternal sex-specific genetic systems (non-recombining region of the Y chromosome (NRY) and mitochondrial DNA (mtDNA)). As both NRY and mtDNA genetic systems have smaller effective population sizes than autosomal markers, they are more prone to genetic drift14, 15, 16 and are therefore more likely to differ among groups than are autosomal markers. However, because each is, in effect, a single linked locus, interpreting observed differences among groups must be undertaken with a high level of caution. The control region of the mtDNA sequence, due to its high mutation rate, has been extensively used in examining the impact of EBSP on the genetic landscape of sub-Saharan Africa.5, 17, 18, 19 It has been postulated that some mtDNA haplogroups (eg, L3b, L3e and L2a), based on their distribution in sub-Saharan Africa, are associated with the EBSP, whereas the presence of haplogroup L1c at high frequency in some populations on the western route is thought to be the result of assimilation of local female hunter gatherers.17 It has been suggested that because agriculturist men are more likely to marry local women rather than vice versa,15, 16 the maternal genetic profile of Bantu-speaking groups is marked by considerable diversity. Despite this level of diversity, however, there is a high level of similarity between groups.20

The increase in the rate of identification of slowly mutating NRY binary markers (ie, unique event polymorphisms (UEPs))21, 22, 23 has resulted in many studies designed to investigate the paternally mediated genetic relationships of sub-Saharan African populations. Scozzari et al24 and Underhill et al25 found UEP (M2 and its analogues such as DYS271G) present at high frequencies specifically in sub-Saharan Africa and suggested this marker as a signature of EBSP. Since then, this marker (now defining the E1b1a haplogroup) has been typed in many groups across sub-Saharan Africa19, 26, 27, 28 and, without exception, all studies have shown that the majority of NRY types in Bantu-speaking groups belong to this haplogroup. Although sampling in most NRY studies of sub-Saharan Africa has, in the past, been quite limited in terms of geographic coverage and sample sizes, the distribution of this haplogroup is relatively well described in groups living along both the postulated western and eastern routes of the EBSP, as well as in Senegal29 and Cameroon27, 30 in West Africa. Interestingly, de Filippo et al31 recently reported differences in the frequencies of haplogroups E1b1a and E1b1a7 between Bantu and Non-Bantu Niger-Congo speakers. As the EBSP shows a clearer genetic legacy in the paternally inherited genetic system compared with mtDNA (evident from high and similar frequencies of E1b1a) in sub-Saharan Africa,32 it is possible that, as suggested by de Filippo et al,31 fine-scale E1b1a typing of Bantu-speaking communities throughout sub-Saharan Africa may add more structure to the geographic distribution of haplogroups. This, we hypothesise, may shed light on routes taken during their expansion.

Pakendorf et al7 in a recent review of the contribution made by molecular genetic analysis to the study of EBSP concluded that patrilocality and possibly polygyny may have contributed to NRY, but not mtDNA, association with linguistic affinity. They note that in studies to date, Eastern African groups are greatly underrepresented but essential for investigating the direction of expansion. They further observe that the lack of genetic data makes it premature to reach sweeping conclusions concerning the EBSP.

In this study, we analyse, as did Alves et al,33 both UEP and short tandem repeat (STR) (in this study restricted to NRY) to show that geographic frequency distributions and the time to the most recent common ancestors (TMRCAs) of haplogroups, comprising haplogroup E1b1a in 43 sub-Saharan African groups (n=2757) with diverse linguistic affiliations (Supplementary Figure S1), reveal multiple waves of expansion from West Africa, with a late expansion along the eastern route but not the western. As a consequence, this study makes an important contribution to filling the gap. Pakendorf et al7 identify and provide evidence of greater complexity in the process of the EBSP as suggested by Alves et al33 and Montano et al.34

Materials and methods

Samples

Buccal swabs were collected from males >18 years old unrelated at the paternal grandfather level but otherwise randomly selected from 43 groups across sub-Saharan Africa (Supplementary Table S1, samples from Ghana, Nigeria and Cameroon were included in Veeramah et al (2010)35 and from South Africa in Thomas et al (2000)36). These locations mainly cover West, Central-West, East, South-East and South Africa. All of the groups characterised in this study speak a Niger-Congo language, except for the Anuak in south-west Ethiopia who speak a Nilo-Saharan language. All buccal swabs were collected anonymously with appropriate ethical approval and informed consent. Sociological data were also collected from most individuals, including age, current residence, birthplace, self-declared cultural identity, first language, second language and (when available) religion of the individual, as well as similar information on the individual’s father, mother, paternal grandfather and maternal grandmother. The samples were classified into groups primarily by cultural identity, first language spoken and then by place of collection. Where collections from a particular group were made in more than one location, locations are represented by averages of geographic coordinates. DNA from Congolese samples was extracted using the Gentra protein precipitation method (Gentra Systems, Minneapolis, MN, USA). Previously collected buccal-swab DNA samples from ethnic groups across sub-Saharan Africa were extracted by the standard phenol-chloroform method. The range and mean of sample sizes of the 43 groups are 25–118 and 63, respectively.

Y-chromosome typing

A combination of UEPs and STRs in the paternally inherited NRY was typed in eight Congolese groups (n=591). The polymorphic markers are six STRs (DYS19, DYS388, DYS390, DYS391, DYS392 and DYS393) and four UEPs (M191, U175, U290 and U181) characterising the E1b1a haplogroup, which is modal in most population groups within the area of the EBSP.25 The four UEPs were typed using a tetra primer ARMS PCR method37 with minor modifications. The outer and two inner fragments were amplified in a 10-μl reaction volume containing 1 μl (1 ng) of template DNA, 1.6 μl (50 uM) dNTPs, 9.3 nM TaqStart monoclonal antibody (BD Biosciences Clontech, Oxford, UK), 0.13 U of Taq polymerase (HT Biotech, Cambridge, UK) and outer and inner primers (see Supplementary Table S2 for primer details). All samples (96-well plates) were then placed on a thermocycler under the following conditions: denaturation at 95 °C for 5 min, followed by 35 cycles of denaturation (95 °C) for 45 s, annealing (see Supplementary Table S2 for annealing temperatures) for 45 s and elongation (72 °C) for 45 s. The final step of the PCR programme was a 7-min extension at 72 °C before a 30 min hold at 4 °C. Where samples were ancestral for the four UEP markers, a further six to eleven UEPs (UEP1 and UEP2 kits: sY81, SRY4064, YAP, SRY10831, M13, M9, SRY465, M20, Tat, 92R7 and M17) were typed.38 NRY haplogroups were classified according to the nomenclature of the Y-Chromosome Consortium39 (Figure 1) and STR repeat sizes were assigned according to the nomenclature of Kayser et al.40 Additionally, the four E1b1a-specific UEPs were typed in 1820 samples, previously characterised as E1b1a in the TCGA database (published35, 36 and unpublished data), from the 35 non-Congo, sub-Saharan groups listed in Supplementary Table S1. Although the battery of the NRY markers typed in UEP kits gives a relatively crude resolution of NRY haplogroups, the typing of four UEP markers within E1b1a considerably increases the resolution of NRY types associated with EBSP.32

Figure 1
figure 1

Genealogical relationships of UEP markers used to define NRY haplogroups. The box identifies the E1b1a clade, exclusively observed in population groups with recent African ancestry.

Statistical analysis

Haplotype diversity, h, and its SE were estimated from unbiased formulae of Nei41 and was performed using Arlequin software version 3.0.42 Average squared difference (ASD) in STR allele size between all chromosomes and the presumed ancestral haplotype (assumed to be the modal haplotype), averaged over loci, were estimated using YTIME software,43 and corresponding 95% confidence intervals were calculated as described in Thomas et al44 using the ‘R’ environment of statistical computing (www.R-project.org). The TMRCA was estimated using an average NRY STR mutation rate of 0.00245 and generation time of 25 years. The Fisher’s exact test was also performed in the R environment. The probability of observing a particular haplotype, if present, in a randomly collected set was assessed by the equation (1−q)n=(1−P), where P is the probability of observing the haplotype, q is the minimum frequency of the haplotype to be observed and n is the number of chromosomes. According to the equation, the minimum frequency at which a haplotype is present for it to have a 95% probability of being observed, given that n chromosomes are typed, is q=1−10(log(0.05)/n).

Results

Of the possible 17 haplogroups, 12 were observed in the complete data set with haplogroup E1b1a modal (0.847, range in population groups 0.389–0.957), both overall and in every sub-Saharan African group. Only two other haplogroups exceeded 5% of the total: BT* (xDE,KT) (7.5%) and E* (xE1b1a) (5.1%). Table 1 reports the frequencies of all observed haplogroups, including the component haplogroups of E1b1a. Haplogroup E1b1a7 or E1b1a8* is modal in all groups with the exception of Bankim (Cameroon) and Fante (Ghana). The pooled frequencies of E1b1a component haplogroups, based on their geographic locations, are also shown in Figure 2.

Table 1 Frequency distribution of NRY haplogroups in 43 sub-Saharan African population groups
Figure 2
figure 2

Visual representation of the distribution of E1b1a component haplogroups in sub-Saharan African groups with sample totals. Sectors in pie charts are coloured according to the haplogroup colour code to the left. Sample sizes are indicated within the pie charts. Samples in the Congolese data set have been divided into three pie charts representing Bantu H, B and C speakers.

All haplogroups within E1b1a were observed in the Bantu Homeland, West-Central Africa, East Africa and Ghana, whereas haplogroup E1b1a8a1a, although present in the Bantu Homeland and East Africa, was not observed in either Ghana or West-Central Africa.

Diversity (h) of E1b1a was calculated at the five component-haplogroup level ranged from 0.379 to 0.753, excluding the Anuak (h=0). For comparison, the NRY haplotype diversity treating E1b1a as a single haplogroup ranged from 0.821 to 0.945, with the exception of Anuak who displayed a much lower diversity (h=0.516).

The EBSP six-STR haplotype was modal in 36 out of the 43 groups (see Supplementary Table S3) and was almost always a member of E1b1a8 (frequency of 96.4%, P<0.0001). Table 2 contains the six-STR haplotype gene diversities for E1b1a component haplogroups present in all three West, West-Central and East-Central regions.

Table 2 STR haplotype diversity within E1b1a component haplogroups present in all Bantu-speaking groups

The TMRCA for each haplogroup-defining UEP (with at least 20 chromosomes) is presented in Table 3 along with regions and countries within which each haplogroup was observed. The finer branches of the genealogical tree were associated with lower estimates of TMRCA (Figure 1). TMRCA for E1b1a as a whole was estimated at 6175–6588 YBP with the TMRCA for the youngest haplogroup (E1b1a8a1a) estimated at 1100–1638 YBP.

Table 3 Estimated TMRCA of E1b1a UEP dates and distribution of E1b1a component haplogroups in sub-Saharan Africa

Discussion

We analyse frequencies of halpogroups and estimates of TMRCA to answer two questions: (a) Is there evidence of more than one ‘expansion’ of paternal line ancestors of Bantu-speaking people living in present day sub-Saharan Africa? and (b) If so, did those ‘expansions’ take different routes? We define ‘expansion’ in this context to mean diffusion of alleles. In doing so, we assume (a) that the NRY has a genealogy that, at least in that part of the genealogical tree analysed in this paper, can be unambiguously constructed using UEP polymorphisms47 (Figure 2) and (b) ASD is a measure of STR diversity that increases linearly over time and that calculating ASD from the common ancestor of a random sample of NRY that are members of a haplogroup provides an estimate of the TMRCA.43 Consistent with previous studies, we observed a high frequency modal of six-STR NRY haplotype (DYS19, 388, 390, 391, 392, 393:15–12–21–10–11–13) throughout the area of the EBSP.26, 35, 36 Interpreting the frequencies of the component haplogroups of E1b1a within the context of their geographic distribution and TMRCA values throws additional light on the expansions associated with the EBSP. These data are consistent with multiple expansion events southwards from West Africa. Haplogroup E1b1a7 (defined by M191) is modal in most groups in countries from Ghana to Mozambique and only at slightly lower frequency in South African Bantu speakers (33.8% compared with E1b1a8* at 37.8%). The TMRCA at 4700–5300 YBP is entirely consistent with the haplogroup being present in West Africa at the dawn of the EBSP. It is likely to have expanded south as the demographic events comprising the EBSP took place. The haplogroup E1b1a8, defined by U175, has a TMRCA of only 1863–2163 YBP but a geographic distribution, excepting the Anuak of Ethiopia, which is equally extensive as that of E1b1a7. If it is assumed that an earlier expansion had already taken place, this would be consistent with a subsequent, rapid expansion from West Africa southwards along both the western and eastern routes. This is consistent with the analysis of de Filippo et al,31 which is also supportive of a rapid expansion. It is interesting to speculate on the possibility that this later expansion was associated with the contemporaneous development of metallurgy.

The distribution of haplogroup E1b1a8a1* defined by U290 in the absence of U181 with a TMRCA of 1413–1725 YBP is similar to that of E1b1a8 and may be interpreted in the same way. The distribution of haplogroup E1b1a8a1a (defined by U181) with a very recent TMRCA of only 1100–1638 YBP is very different, however, being restricted to Nigeria and the east side of sub-Saharan Africa (Figure 2). As a consequence it is consistent with a late, rapid expansion from south of the Grassfields of Cameroon that did not include expansion along the earlier western route. Because the West-Central African E1b1a data set is sufficiently large (n=516; eight groups), we would have expected to observe the E1b1a8a1a haplotype, if present at a frequency as low as 0.0058. Therefore, it is unlikely that the absence of this haplogroup is due to drift after the initial stage of expansion when only a small number of individuals may have been involved or is simply not being observed in the present study. We note that the phenomenon of surfing can explain the absence of an allele in only some groups that are the consequence of range expansion.48, 49 However ,unless the allele (in this case NRY belonging to haplogroup E1b1a8a1a) became extinct early in the western route expansion (which is, in effect, the same as not having been part of that expansion), there is no reason to suppose that extinction of the haplogroup in western route groups (Guthrie classification H, B and C) was more likely than in eastern groups (Guthrie classification N and P). See Supplementary Table S4 for Guthrie classifications of all Bantu-speaking groups included in the analysis. In this study, haplogroup E1b1a8a1a, the haplogroup with the shortest TMRCA, was observed in all eastern data sets (three from Malawi, one from Mozambique (in both cases, all speakers of Guthrie classification Bantu languages N and P spoken on the eastern side of Africa) and one from Pretoria, n (samples)=18) but in none of the eight western groups (all speakers of Guthrie classification Bantu languages H, B and C spoken on the western side of Africa) (Fisher’s exact test: haplogroup present/absent in data set P=0.0008; haplogroup frequency P<0.0001). Comparisons made without including data sets from South Africa and Mozambique, so as to exclude the possibility of admixture between western and eastern Bantu-speaking expansions in the southern extremity of the continent, remain significant for both presence/absence of E1b1a8a1a in data sets and for frequency of the haplogroup (P<0.01). The genetic data are thus in broad agreement with analysis based on linguistic studies, which suggests that the spread of Bantu languages is the consequence of successive dispersals and that a single large-scale migration by Bantu speakers is unlikely.3 It is also consistent with suggestions that differences between eastern and western Bantu languages are a consequence of expansion patterns.3 This interpretation suggests the absence of substantial male-mediated gene flow from East-Central Africa to West-Central Africa during the past millennium, because had it occurred, it would be expected that examples of haplogroup E1b1a8a1a would have been observed in the Congolese groups included in this study. Further support for the EBSP origin from the Nigeria/Cameroon area comes from the observation that E1b1a component-haplogroup STR diversities are greater in West Africa than in either West-Central or East-Central Africa (Table 2).

Recently, Alves et al33 analysing a battery of 14 DIPSTRs (ie, deletion/insertion polymorphisms tightly linked to STRs) in 19 Bantu-speaking groups from Mozambique and Angola concluded that it is becoming increasingly difficult to accept models, suggesting an early split between eastern and western Bantu-speaking populations, whereas Montano et al34 analysing NRY UEPs and STRs in groups from Nigeria, Cameroon, Gabon and Congo concluded that the evolutionary scenario is more complex than previously thought. Our analysis of NRY from groups over a wide geographic area is consistent with both these conclusions.

We conclude that analysis of NRY in 43 widely distributed population groups from across sub-Saharan Africa provides evidence of multiple expansions from West Africa along the western and eastern routes and a late specifically eastern expansion at some time during the past two millennia during a period in which male-mediated gene flow from East-Central to West-Central Africa does not appear to have taken place, at least to any significant extent. Future studies that examine variation in the NRY E1b1a clade in Bantu-speaking population groups representing the East African coast will help to further elucidate the late eastern EBSP.