Introduction

Human population genetics has recently completed a circle. It started with the so-called classical polymorphisms (ie, blood groups and other protein polymorphisms) that were analyzed on the basis of their allele frequencies with potent statistical instruments such as principal component analysis (PCA). This era culminated with the landmark publication of the magnificent book by Cavalli-Sforza et al1; however, a few shortcomings of classical polymorphisms can be pointed out: relatively few loci were used; their relationship to the underlying genetic variation was mostly unknown; and they could have been subjected to confounding by natural selection. PCR and automated sequencing heralded the uniparental marker era: mtDNA and the non-recombining region of the Y-chromosome could be routinely analyzed, and a firm phylogeography could be established for both genomic regions, allowing the dissection of population structure with unprecedented precision and reliability. But yet, they behave as just two loci, wherein natural selection cannot be ruled out, as well as the peculiarities associated with their sex-specific transmission. Another technological development spearheaded a new breakthrough in human population genetics: single nucleotide polymorphism (SNP) array genotyping platforms have made it affordable to genotype hundreds of thousands of markers. The results are again treated in terms of allele frequencies and subjected to PCA or to newer techniques, such as Bayesian classification algorithms. Now, the whole genome is covered, and the action of selection is masked by a vast majority of putatively neutral markers.

The genetics of African populations, of paramount interest given the recent African origin of humankind, has been through the full cycle of studies. Cavalli-Sforza et al1 identified a north–south gradient in the continent that could be attributed to the Bantu expansion, whereas other principal components had a less clear interpretation. African mtDNA phylogeography was firmly established by Salas et al2, who described the structure of maternal lineages in the continent, and identified some haplogroups involved in human expansions such as the Bantu expansion. Deep analyses of maternal lineages in African hunter gatherers (Khoisan speakers and Pygmies) have revealed a clearly structured phylogeny for the mtDNA.3, 4, 5 A number of papers have approached both the general and the more local aspects of non-recombining part of the Y-chromosome in Africa,6, 7 although its phylogeographic structure has not been as refined as its maternal counterpart. More recently, in a landmark paper by Tishkoff et al8, 1327 nuclear microsatellite markers were analyzed in 121 African populations, identifying a number of layers in the African population structure that could be related to history, language, and geography. The continent, south of the Sahara, seems to be dominated by a component mostly correlated with Niger-Congo speaking populations, whereas other components are found in the Sahel, among Nilo-Saharan speakers, and in Afro-Asiatic speakers in the north and northeast. Among the hunter-gatherer populations, the Khoisan-speaking Hadza of Tanzania were clearly distinct, whereas Pygmies could not be discriminated from the South African Khoisan. At higher discrimination level, western Pygmies became distinct, but the eastern Pygmies remained similar to the Khoisan. In another recent publication, Bryc et al 9 analyzed SNP data obtained from West Africans (and African Americans), revealing a structure reflecting primarily language and secondarily geographical distances. Unfortunately, this work is restricted mainly to populations in Central West Africa, around the Gulf of Guinea. Both of these recent studies have greatly advanced our understanding of the genetic structure of Sub-Saharan Africa. However, both in Tishkoff et al and Bryc et al as well as in previous works, the area between Central and South Africa remains under sampled. Clearly, southeast Africa is a key geographical zone to understand the Bantu expansion routes, making it a region of particular interest for the population history of sub-Saharan Africa.

Data

In a case-control study for placental malaria, we obtained, with appropriate informed consent, 180 cases and 180 controls from Mozambique, in southeast Africa. These samples were genotyped with the Affymetrix GeneChip Human Immune and Inflammation 9K SNP Kit (Santa Clara, CA, USA), resulting in a total of 279 samples with reliable data after stringent quality control. Other African samples with genome wide SNP data available are Biaka Pygmies, Mbuti Pygmies, Mandenka, Yoruba, San, and Bantu-speakers from the HGDP panel10, and the Maasai, Luhya, Yoruba, and African-Americans from HapMap Phase 3 (http://hapmap.ncbi.nlm.nih.gov/). The intersection of all arrays provides a common set of 2841 SNPs with genotype data for all populations (see Supplementary Information for details).

Results

First, we wanted to test whether the number of SNPs available (2841 SNPs) provides enough genetic resolution to detect any structure in African populations and provide a reference for the number of SNPs needed in population studies. To that effect, we combined the global Human Genome Diversity Panel (HGDP) and HapMap phase 3 genotype data (∼460 000 SNPs) and subjected them to PCA (see Supplementary Information for details). Results are similar to those obtained with the HGDP samples, with the first and second PC (Figure 1a) separating East Asia (upper left corner of the plot) from Europe (bottom centre) and sub-Saharan Africa (upper right).10, 11 The same structure is recovered when random subsamples of 100 000 (Figure 1b), 10 000 (Figure 1c), and 1000 (Figure 1d) SNPs are considered, although inter-individual variation increases. A random set of 2841 SNPs from this pooled HGDP-HapMap dataset (Figure 1e) performs similarly to the set of 2841 SNPs related to immunity and inflammation (Figure 1f), despite of the slightly reduced interpopulation differentiation of the latter, which is expected as they are gene-based SNPs.12 We can conclude that the common set of 2841 SNPs genotyped is an appropriate tool to study population structure in African populations; in general, worldwide patterns are evident and robust when using a minimum of 1000 SNPs.

Figure 1
figure 1

PCA of merged HGDP and Hap Map 3 samples. Panels show the results of the PCA for the full merged set of SNPs (460 147 SNPs) (a), for random subsets of 100 000 (b), 10 000 (c), 1000 (d), and 2841 SNPs (e), as well as for the 2841 SNPs in the merged analysis set including the samples from Mozambique (f). As can be seen, the general pattern of differentiation is reproduced even using only a random subset of 1000 SNPs. Different colors indicate continental region for the respective populations . Abbreviations: AME, Americas; CSASIA, Central and South Asia; EASIA, East Asia; EUR, Europe; MENA, Middle East and North Africa; OCE, Oceania; SSAFR, sub-Saharan Africa.

Next, we applied PCA13 and STRUCTURE14, 15 to 775 individuals in 11 populations of sub-Saharan African descent. The first PC (Figure 2a) and STRUCTURE with K=2 (Figure 3) separate the Nilo-Saharan-speaking Maasai from all other populations, with neighboring Luhya and African Americans in an intermediate position. Both the second PC and K=3 separate the hunter-gatherer samples, presumably ancestral Pygmy and San populations from the rest. The third PC allows us to discriminate between western/central (Mandenka, Yoruba), eastern (Maasai, Luhya), and southeastern populations (Mozambique), irrespectively of language family. This is the PC that is mostly correlated with geography (Figure 2c), and the fact that it is the third rather than the first component, as would be expected if isolation by distance was the predominant force shaping genetic diversity,16 implies that directional population movements (such as the Bantu expansion) and barriers to gene flow (such as that between food producers and hunter gatherers) are more relevant than geographic distance to understand the genetic landscape of sub-Saharan Africa. The distinction between west and southeast Africa is also shown with K=4; at K=5, the Niger-Congo speaking Luhya are separated from the rest. The new component that appears at K=6 is restricted to African Americans and Biaka Pygmies, and is the last component that can be attributed to specific populations.

Figure 2
figure 2

PCA of sub-Saharan African populations. Panels show plots of the first three principal components obtained from the 11 sub-Saharan African populations. (a) First and second components. (b) First and third components. (c) Biplot of rotated PC1 and PC3 superimposed onto a map of Africa. Geographical locations of the populations are indicated by their names and their respective enlarged plot symbols. Different colors indicate linguistic or cultural group for the respective populations (green: Nilo-Saharan; blue: Niger-Congo; orange: Hunter Gatherer; grey: admixed).

Figure 3
figure 3

STRUCTURE results for sub-Saharan African populations. Depicted are the results of five runs each for the number of clusters ranging from K=2 to K=7, combined using CLUMPP.

Discussion

The preceding results are in agreement with what was found previously by Tishkoff et al using microsatellites, and goes beyond with new findings and refinement of previous genetic studies:

  1. i)

    The main distinction is among Niger-Congo groups and the rest, including Nilo-Saharan speakers and hunter gatherers (with the Khoisan having preserved their ancestral language but not Pygmies). Among Niger-Congo populations, geography is the main factor explaining the genetic differences, with a remarkable similarity among western populations (Yorubas and Mandenka), which could reflect a burst in the expansion to the west, related to iron technology and Niger-Congo languages.

  2. ii)

    The southeastern Bantu from Mozambique are remarkably differentiated from the western Niger-Congo speaking populations, such as the Mandenka and the Yoruba, and also differentiated from geographically closer Eastern Bantu samples, such as Luhya. These results suggest that the Bantu expansion of languages, which started ∼5000 years ago at the present day border region of Nigeria and Cameroon, and was probably related to the spread of agriculture and the emergence of iron technology,17, 18, 19 was not a demographic homogeneous migration with population replacement in the southernmost part of the continent, but acquired more divergence, likely because of the integration of pre-Bantu people. The complexity of the expansion of Bantu languages to the south (with an eastern and a western route20), might have produced differential degrees of assimilation of previous populations of hunter gatherers. This assimilation has been detected through uniparental markers because of the genetic comparison of nowadays hunter gatherers (Pygmies and Khoisan) with Bantu speaker agriculturalists.2, 21, 22, 23, 24 Nonetheless, the singularity of the southeastern population of Mozambique (poorly related to present Khoisan) could be attributed to a complete assimilation of ancient genetically differentiated populations (presently unknown) by Bantu speakers in southeastern Africa, without leaving any pre-Bantu population in the area to compare with.

  3. iii)

    The difference between hunter-gatherers and the rest of South Saharan populations is important but it is not the main trait in the African genetics. To note is the strong similarity among the three studied populations, with no specific Pygmy component, but an important Bantu introgression (as seen in K=3) in Biaka Pygmies. Pygmies should be included along with Khoisan in the search for deep-rooted African and Human lineages. Moreover, the specific component that identifies the three hunter-gatherer populations is found at a small amount in all other African populations, as a possible result of introgression with previous settlers of most African territory.

As a more general observation, we found that as little as 1000 genome wide SNPs are enough to robustly recover the patterns of genetic structure among worldwide populations. It has to be noted that even though this low number seems to be sufficient for inferring major demographic events and broad population structure, it remains doubtful whether it will be sufficient for more fine-scale inference, as for example within a genetically uniform region such as Europe. Nevertheless, the high level of genetic structure in sub-Saharan Africa allows us to be confident in our conclusions. Furthermore, the fact that our dataset of 2841 SNPs has only limited fine-scale resolution makes the observed strong differentiation of the population from Mozambique even more striking. The genetic analysis of a large number of SNPs is thus providing a robust tool to refine our understanding of past populations history.