Introduction

Genetic data, particularly from non-recombining regions of the genome, have always had a major role in the understanding of the past and present of the human species. In this context, the male-specific region of the Y chromosome, which is the largest non-recombining region in the human genome, has become a valuable tool for the study of historical and pre-historical human movements, among other anthropological questions.1

The fact that the Y-SNP haplogroups are so stable and are distributed in such a specific geographic location allows the understanding of human migrations according to their distribution. However, in Africa, the study of the major Y chromosome haplogroups has not been very detailed and has been essentially restricted to the principal clades observed in this continent, namely the non-monophyletic lineages of clade A and haplogroups B and E. Although haplogroup E is widespread across sub-Saharan Africa, lineages from A and B, the most basal branches in the Y chromosome tree, are usually restricted to specific populations, especially pastoralists and hunter-gatherers.2, 3, 4

Apart from the aforementioned haplogroups, some studies have reported unexpectedly high frequencies of haplogroup R1b1-P25 in some African populations.5, 6, 7 This haplogroup is thought to have originated in Europe, and its high frequencies in Central-West African countries have been explained through a migration back to Africa in prehistoric times, mediated in Africa by speakers of the Chadic family of Afro-Asiatic languages.6, 7 The arrival of this ethnic group to Lake Chad from the Proto-Afro-Asiatic homeland in Eastern Africa has been explained by two different hypotheses. Blench’s theory (the ‘inter-Saharan’ hypothesis)8 suggests that Chadic speakers arrived at the Chad Basin through an east to west migration through the Sahel, whereas Ehret’s theory (the ‘Trans-Saharan’ hypothesis)9 suggests that they arrived from the north through a migration across the Sahara desert. The latter hypothesis has been used as the explanation for the high frequencies of the R1b1-P25 haplogroup in Central-West Africa, mainly due to its presence in speakers of other Afro-Asiatic languages in North Africa.7, 10 Nevertheless, there is still an on-going debate about the most suitable explanation for these observations.10, 11

To obtain a better and deeper understanding about this migration, a more detailed study of populations in all of the regions across the proposed paths is necessary. Equatorial Guinea is a Central-West African country located below the Chad Basin that could also have been influenced by this migration. The Y-SNP haplogroups in this country are still uncharacterised, although it borders Gabon and Cameroon, where the highest frequencies of the R1b1 haplogroup in Africa have been found.7

The main objective of the present work is to better characterise the male lineages from Equatorial Guinea, particularly lineages of the Y chromosome haplogroup R1b1, to investigate the Chadic migration route hypothesis. The comparison of our results with previously published data will hopefully contribute not only to a better characterisation of the still understudied African genetic diversity but also to a better understanding of the unexpectedly high frequencies of the R1b1 haplogroup in this region of Africa.

Materials and methods

Population sample and DNA extraction

A total of 112 unrelated males from Equatorial Guinea (living in Madrid, Spain, at the time of sample collection) were analysed for 17 STRs and 49 biallelic polymorphisms (single-nucleotide polymorphisms (SNPs)) in the male-specific region of the Y chromosome.

Sample collection was performed in 2001, from adult males that were born and had lived in Equatorial Guinea before moving to Spain for work, keeping their Guinean nationality. The great majority of the population in Equatorial Guinea is of Bantu origin (the largest tribe, Fang, represents 85.7% of the population), according to the 1994 census.12 This strong Bantu influence is also noted in the language; apart from the more recent official European languages (Spanish and French) and two creole languages (admixture with Indo-European), all of the others belong to the Narrow Bantu linguistic group.13 Although no information was available concerning the ethnic background of each male, it is expected that they mostly represent a mixture of Fang and Bubi groups from different regions in Equatorial Guinea.

DNA was extracted using a standard Chelex-100 (Bio-Rad, Hercules, CA, USA) method.14

Y-STR typing

The AmpFℓSTR Yfiler PCR Amplification Kit (Applied Biosystems, Foster City, CA, USA) was used to analyse 17 STR markers located in the male-specific region of the Y chromosome. An ABI 3130 Genetic Analyser (Applied Biosystems) was used for Y-STR typing, and the results were analysed using GeneMapper ID software v3.2 (Applied Biosystems). The Y-STR alleles were designated according to the ISFG recommendations.15 Several additional Y-STRs had already been studied in a subset of these samples using the PowerPlex Y System (Promega Corporation, Madison, WI, USA), following the protocol described by Arroyo-Pardo and collaborators.16

Y-SNP typing

The analysis of 49 Y-SNPs allowed the identification of 49 different Y chromosome haplogroups (Figure 1). The SNP markers used in this study were selected based on the Y chromosome parsimony tree17 to characterise the haplogroups most frequently found in sub-Saharan Africa as well as to identify those that are thought to have been brought by the Europeans to Africa during the last five centuries.3, 5, 18, 19, 20, 21, 22 Moreover, eight additional SNPs inside haplogroup R1b were also included in our set to increase the resolution of this haplogroup.7

Figure 1
figure 1

Phylogenetic tree of the Y chromosome haplogroups studied. Biallelic markers are displayed in each branch. Haplogroups are named according to Karafet et al.17

A hierarchical approach (Supplementary Figure S1) based on the phylogeny reported by Karafet et al17 was used, eliminating the need to type all SNPs to define each haplogroup. This method was performed using single and multiplex PCR, RFLP, SNaPshot and direct DNA sequencing analysis. The Y Alu polymorphic element (YAP), Y-SNPs M41, M54 and M75, and three multiplexes (Multiplex 1 plus M13, Multiplex E and Multiplex B) were analysed as described by Gomes et al3 Based on the results from Multiplex 1 plus M13, a new set of Y-SNPs was typed: (i) if only the markers SRY10831.1 and M213 carried the derived allele, Multiplex 2 (described by Brión et al23) was performed under the same conditions used for the other three multiplexes; (ii) in chromosomes carrying the derived allele at marker P25, M269 was typed using the RFLP technique. The amplification PCR for this marker was conducted using the primers described by Beleza et al,18 and the conditions were optimised from Gomes et al3 for the multiplexes used. Touchdown PCR was performed with a first set of 5 cycles with a higher annealing temperature of 65 °C for 90 s and a second set of 30 cycles in which the temperature was decreased to 62 °C, also for 90 s. After confirming the amplification in a polyacrylamide gel using the silver staining method, 2 μl of the amplified product were incubated overnight at 37°C with 0.05 μl of the restriction enzyme MvaI at a concentration of 10 U/μl, 0.5 μl of 10 × buffer R+ and 2.45 μl of water. As the restriction site for MvaI is present only when the SNP is in the derived state, the two alleles can be discriminated in a silver-stained polyacrylamide gel.

The eight remaining SNPs in the set—M18, M335, M343, P297, V7, V8, V69 and V88—were typed to improve the resolution of the R1b haplogroup. This was performed through direct DNA sequencing in samples carrying the derived allele at marker P25 and the ancestral allele at M269. For all but one of the markers, new primers were designed for this analysis (Supplementary Table S1). The PCR amplification conditions were the same as those used for the aforementioned multiplexes, except in the case of the V88 marker. This marker is located in a region of the Y chromosome that has significant homology to the X chromosome. Therefore, to increase the specificity for the Y chromosome, the annealing temperature used during the amplification PCR was increased to 63°C. The PCR product purification, sequencing analysis and extended product purification were performed according to Gomes et al.3 The products of the sequencing reaction were run on an ABI PRISM 3130xl Genetic Analyser (Applied Biosystems) and were analysed using DNA Sequencing Analysis Software v5.2 (Applied Biosystems).

Data analysis

For the analysis of both Y-STRs and Y-SNPs, diversity values and pairwise genetic distances (FST) were calculated with the software Arlequin 3.5.1.224 Y-SNP haplogroup frequencies were determined by direct counting.

Pairwise genetic distances were visualised in a two-dimensional graphic through the multidimensional scaling (MDS) method implemented in the software STATISTICA 7.0.25

Phylogenetic networks were constructed to investigate the genetic relationships within haplogroup R1b1-P25 using the software Network 4.6.0.0 (http://www.fluxus-engineering.com/sharenet.htm) with sequential application of the reduced median26 and the median-joining27 methods to resolve extensive reticulation. Differential microsatellite weighting (inversely proportional to variance) was applied to obtain the most parsimonious network in accordance with Qamar et al.28

Levene’s test was used to assess the homogeneity of variances in IBM SPSS Statistics, version 19.

Results and Discussion

Characterisation of the male lineages of Equatorial Guinea

In the 112 samples analysed, we were able to identify 104 different haplotypes and 13 different haplogroups (Figure 1 and Supplementary Table S2). The majority of the Y-SNP lineages found in this study (almost 80%) belong to haplogroup E, namely bearing the M2-derived allele, the most common haplogroup in sub-Saharan Africa Bantu populations. Apart from this haplogroup and five other chromosomes, all the remaining samples belong to haplogroup R, namely R1b1-P25, a lineage that is rare in Africa and is found mainly in Europe and Asia. Nevertheless, high frequencies of these lineages in some African populations have been previously reported by several authors.5, 6, 7, 29, 30

Lineages in clade A, although almost entirely restricted to Africa, have been described in Bantu populations at low frequencies. These lineages are mostly present in Nilo-Saharan speakers (for example, reference2, 5, 6, 21) and therefore, a high frequency was not expected in our sample. This expectation was confirmed; only one of the chromosomes in our sample belongs to this haplogroup, more specifically to the haplogroup A3b2-M13, which is more frequently observed among Nilotes than other African groups.3 Similarly, only one chromosome in our sample was found to belong to haplogroup B, namely to the branch that is most common in Bantu individuals (defined by the derived allele at M150).2, 5, 18, 21

The major haplogroup E, the most diverse clade in the Y chromosome tree, is widespread across the African continent, where its highest frequencies are found and is also present in the Middle East, southern Europe, and Central and South Asia.17 This clade was the most frequent in our sample, representing almost 80% of all the chromosomes, similar to results for other sub-Saharan populations (for example, reference5, 30). The most frequent sub-lineage found in our sample inside this clade was the haplogroup E1b1a-M2, proposed as a marker of Bantu expansion.22, 30 Nevertheless, some other typically non-Bantu lineages were also found in our sample. One is the E1a-M33 haplogroup, which is usually not found further South in the Bantu expansion route (for example, Angola18) but is detected at higher frequencies above the starting point of these migrations (for example, Guinea-Bissau31). Therefore, the presence of this lineage could be an indication of a genetic pool that existed before the beginning of the Bantu expansion and was not spread by this movement. Other non-Bantu lineages found within this haplogroup are E1b1b1a-M78 and E1b1b1-M81, which have been frequently found in North Africa and also in Europe.32

Both haplogroups G and N (accounting for two and one chromosomes, respectively) are rare in Africa, and their low frequency in this sample can most likely be explained by a recent Eurasian influx. Altogether, the proportion of recent Eurasian admixture found in our sample is approximately 15% (haplogroups E1b1b1b-M81, G-M201, N1c-Tat, R1b1b2-M269 and two chromosomes belonging to E1b1b1a-M78—see criteria below), which is easily explained by the well-reported European arrivals to this territory within the last five centuries.

For the Y-SNP data, the haplogroup diversity of our sample (0.7526±0.0261), although higher than that found in other West African populations,18 is still lower than for other regions, a fact that has been explained by a loss of diversity during the Bantu migrations.3, 5, 18 For the Y-STRs, high levels of haplotype diversity (0.9987±0.0014) and of the mean number of pairwise differences (9.5048±4.3932) were found.

The results obtained from Equatorial Guinea were compared with available data for other African populations5, 33, 34, 35 (see Supplementary Figure S2 and Table 1). However, Africa is still a poorly studied continent, and data for comparative analysis were not always accessible. Furthermore, the data for all populations need to be reduced to the same resolution level, requiring a minimal common set of Y-STRs and further constraining the amount of data available for comparisons. To overcome potential bias due to recent historical factors (for example, colonisation by Europeans), the levels of genetic diversity were calculated in different African samples and within each lineage (with n>5) after excluding chromosomes that had most likely been recently introduced by Europeans (including haplogroups E1b1b1, F, G, N1c, R1a, R1b1b2 and T). Because chromosomes belonging to haplogroup E1b1b1a-M78 can be considered of either African or European ancestry, a search for identical haplotypes was performed on the YHRD,36 and only those that did not present a match in European populations were maintained in the analysis.

Table 1 Diversity indices estimated with 10 Y-STRs and Y-SNP haplogroup diversities, excluding those chromosomes that most likely have been recently introduced by Europeans (see text for excluded haplogroups)

The diversity values observed for Y-SNPs are lower for Bantu populations (Table 1), which may be an indication of the homogeneity that characterises sub-Saharan African populations after the replacement of the previously existent lineages as a consequence of the Bantu expansion.3, 5, 18 In contrast, the levels of Y-STR diversity observed are high and generally similar across populations (Table 1), with the highest values observed mainly in Bantu samples. Moreover, the mean number of pairwise differences (MNPD) values are more heterogeneous, and the Bantu samples generally presented the lowest values. Additionally, a population further south along the route of the Bantu migrations (that is, Cabinda) presents a lower MNPD value that those near the migration origin (that is, Equatorial Guinea and Bantu sample from Cameroon), which could be an indication of genetic pool drift along the Bantu expansion path, responsible for the loss of variation in populations on the edge of the route. Nevertheless, an analysis of the MNPD within each lineage shows that no significant differences are found within each population, indicating that no haplogroup suffered a strong founding effect in those populations. This result was also confirmed through the analysis of WIMP (weighted mean intralineage mean pairwise difference),37 which shows a pattern similar to that of MNPD.

Population comparisons were performed using a classical genetic distance method (FST). A stepwise-based model (RST), which accounts for the number of differences observed at each locus assuming the single stepwise model for the formation of new alleles, has been developed as a statistic for microsatellite data.38 However, the weight given to mutations by this method does not reflect the demographic events involved in Y-STR evolution in sub-Saharan populations because it underestimates the importance of genetic drift.

For most of both Y-STR and Y-SNP data comparisons performed, a statistically significant value was obtained for pairwise differences (P<0.05). In our sample, the only non-significant values obtained were with Cabinda (for both markers), the Bantu sample from Cameroon (for Y-STRs), and with Guinea-Bissau and the Bantu sample from Gabon (for Y-SNPs) (Supplementary Tables S3 and S4).

An MDS analysis was performed to allow an easier interpretation of the pairwise distance values matrix (Supplementary Figures S3 and S4). Through an analysis of both MDS plots obtained, a cluster composed by the Bantu samples was observed, including our sample from Equatorial Guinea. In addition, the non-Bantu populations are spread across the plot. This is another indication of the importance of ethnographic information when studying sub-Saharan populations because high distance values are obtained for different samples (labelled according to their linguistic affiliation) within a country (for example, Pygmy and Bantu samples from Gabon).

Haplogroup R and the ‘back to Africa’ hypothesis

Haplogroup R is the most common haplogroup in European populations, and although it is usually rare in Africa, chromosomes bearing the P25-derived allele (lineage R1b1) have been reported at frequencies as high as 95% in some Central African populations.5, 6, 7, 29, 30 This haplogroup is thought to have originated in Central Asia approximately 40 000 years BP and then migrated westward into Europe, achieving its highest frequencies in the western region of this continent.39, 40 Although Balaresque et al41 proposed the hypothesis of a European spread of haplogroup R1b1b2-M269 during the Neolithic, the distribution of the M269 sub-haplogroups and their Y-STR diversities proved to be compatible with a pre-Neolithic diffusion of M269 in Europe.42, 43 The reported high frequencies of this haplogroup in Central-West Africa led to the proposal of a ‘back to Africa’ migration as the justification for the otherwise unexpected presence of this haplogroup in the region.

In our sample, R1b1 was the second most frequently observed haplogroup, which was present in 17% of the sample. Ten out of the nineteen chromosomes that belong to this haplogroup present the M269-derived allele, a typical European marker and may therefore indicate recent European influx. Although European arrivals within the last five centuries have been well reported in this country, European influx was not significant in the neighbouring regions, such as Cameroon and Gabon, where this Eurasian haplogroup is rarely observed.5, 7 The remaining chromosomes presented the V88-derived allele, which was recently reported to be present in all of the typed R1b1*-P25 African chromosome.7 Furthermore, in our sample, most of the V88 chromosomes presented non-consensus alleles at the DYS385 marker, suggesting a different additional sub-lineage within this haplogroup.

A phylogenetic network of R1b1 lineages based on 10 Y-STR haplotypes was constructed with samples from Cameroon and Gabon5 and from the present study (Figure 2a). A clear separation of the R1b1b2-M269 samples was observed; they clustered with the European modal haplotype, supporting their European ancestry. The non-consensus alleles observed in both studies do not show a well-defined separation from the samples with consensus alleles. However, a cluster containing haplotypes with both consensus and non-consensus alleles (representing two different lineages) and another exclusively with consensus alleles could be indicative of at least three different lineages within the R1b1-P25( × M269) haplogroup. Nonetheless, it is important to note that the data from Berniell-Lee et al5 were published before the recently discovered mutations within P25.7 Cruciani et al7 did not find any intermediate variant alleles at the locus DYS385, which were all found within haplogroup R1b1a-V88 in the present work (with only two chromosomes presenting consensus alleles). This result may indicate the presence of different sub-lineages within this haplogroup that are yet to be discovered.

Figure 2
figure 2

Phylogenetic network constructed with information from (a) 10 Y-STR haplotypes (DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438 and DYS439) within the R1b1-P25 haplogroup; and (b) 7 Y-STR haplotypes (DYS19, DYS391, DYS393, DYS439, DYS460, DYS461 and Y GATA A10) for the R1b1a-V88 haplogroup. Microsatellite haplotypes are represented by circles, with size proportional to their frequency in the samples and colours corresponding to their geographic regions.

Phylogenetic comparisons regarding the recent found lineages by Cruciani et al7 were also performed based on information from seven Y-STRs (four typed in the present work and three additional from a previous study of the same Equatorial Guinea samples by Arroyo-Pardo et al16), with R1b1a-V88 samples from North and Central Africa, Europe and from the present work (Figure 2b). The network results show that the Equatorial Guinea samples appear to be related to the remaining African samples. Furthermore, the R1b1a samples from the present work seem to characterise an old lineage due to the highly dispersed pattern presented and do not show the typical signs of a recent origin or founder effects.

The variance of these microsatellite data in the African continent (present study and Cruciani et al7) was also analysed for the V88 marker (Table 2) because the diversity of each lineage reflects its age.41 As observed in Table 2, a higher value of average variance can be observed in the sample from Equatorial Guinea, indicating an ancient origin for these lineages. Furthermore, the highest variance values were found in the Equatorial Guinea population for four out of the seven markers analysed. Nonetheless, Levene’s test only revealed significant heterogeneity between the variance values for DYS439, DYS461 and Y GATA A10 (Table 2). To specifically assess the significance of the values observed in Equatorial Guinea, a homogeneity test was also performed between our sample and data from Central and North Africa (Table 2). In both cases, significant differences were only observed in two of the seven comparisons, which demonstrate that the small size of the samples hinder the extrapolation of our conclusions from the samples to the populations. Moreover, it would be valuable to calculate these levels in less broad areas, but this is not possible when relying on the published data. It is also worth noting that a recent discussion highlighted the influence of the microsatellite mutation rate on age estimates,42 and thus, the set of markers used must be taken into consideration when estimating the age of the lineages.

Table 2 Variance values observed for V88 mutation for each STR locus in Equatorial Guinea and in Central and North Africa7, and Levene’s test values for the three populations together and comparing Equatorial Guinea with each of the other populations

The origin of the V88 lineages

Although the recently advanced hypothesis that the V88 lineages migrated with Proto-Chadic speakers from the North Africa through the Central Sahara into the Lake Chad Basin,7 given that a high variance was found in lineages from haplogroup R1b1a in the sample from Equatorial Guinea, our results are also compatible with an origin of the V88 lineages in Central-West Africa.

Assuming that Central-West Africa is in fact the place of origin for V88, the arrival of Chadic groups in the Lake Chad Basin, coming from the North, is equally likely as the alternative hypothesis of a migration mediated by the Proto-Chadic speaking people coming from East to West Africa (‘Inter-Saharan’ hypothesis), which was previously defended by Lancaster.11

According to Blench’s ‘inter-Saharan’ hypothesis, Chadic speakers originated during the eastward migration of a pastoralist Cushitic group, from the Nile towards the Lake Chad, with subsequent dispersion in different directions around the lake. The pastoralist nature of the groups involved in the dispersion of haplogroup R1b1a in Central Africa is also supported by the V88 distribution as well as by the distribution of its subclade V69 (the only one found in the African continent).7 In fact, although the origin and dispersion of these lineages seem to be much older than the beginning of the Bantu expansion, which began in this same region in Central Africa,44, 45 its frequency decreases drastically towards the South.

Furthermore, in opposition to the previously suggested direction of migration,7 the presence of these lineages in North Africa could also be explained as the result of a migration of V88 carriers from South to North, possibly during the mid-Holocene.

In summary, altogether, our data are compatible with an origin of the V88-derived allele in Central-West Africa and a later migration (or migrations) across the Sahara to North Africa. Assuming this origin for V88, both the ‘Trans-Saharan’ and ‘Inter-Saharan’ hypotheses for the arrival of Chadic groups in the Lake Chad Basin are equally likely, as already described in other studies,8, 46 but this model is also compatible with the dispersion of V88 from Central-West to North Africa, contrary to what was proposed by Cruciani et al.7 Nevertheless, further evidence will be required to support this hypothesis, and additional data are crucial to obtain a deeper understanding of the origin and history of this haplogroup, namely from the populations in the proposed paths of migration.