Introduction

Polymorphisms like bi-allelic mutations associated with the male-specific Y (MSY) chromosome portion, are important tools that proved essential in addressing aspects of human ancestry, migration episodes1 and testing coalescence processes.2 Interestingly, some bi-allelic markers of the Y chromosome not only have geographically defined distributions, but are also associated with certain facets of human culture like languages3, 4 and practice of pastoralism5 all of which contribute to the phenomenon of genetic drift, probably the most single key element in shaping population genetic structures.6 Intuitively, the high correlation between geographical distribution of some of the major E haplogroups and distribution of Afro-Asiatic languages, exemplary of established correlation between languages and genes as proposed by Cavalli-Sforza7, 8 prompted us to revisit such correlation in a multidisciplinary platform better suited to unravel hitherto untold chapters of human history. No better venue to put such approach into practice than the area of the Sahel and East Africa. The Sahel, which extends from the Atlantic to the Red Sea coast of Sudan and Eritrea and the Ethiopian highlands including fringes of the Sahara, has witnessed human population demographic events that were pivotal in prehistoric and historic periods of human history. Early occupation by Homo sapiens of the Red Sea coast of Eritrea,9, 10, 11, 12, 13 and evidences of traces of earlier urban settlements in much of Eritrea14, 15, 16 are some of the archaeological and paleontological evidences that suggest a major contribution of this area to prehistory and migration including the exodus of anatomically modern humans to Eurasia. Furthermore, in addition to the area being strikingly rich in genetic and linguistic diversity, it is one of the few remaining enclaves of traditional pastoralism, a dying human culture.17 Although suggestions has been made that East Africa is the likely place of origin of Y-chromosome haplogroups including the major E haplogroups, yet key questions on human origin and dispersal remain not fully addressed. One query, however, is whether the major macrohaplogroup E present almost in all continents and with particularly high frequency in East and North Africa in plethora of ancestral lineages, because of gene flow or an original early event of in situ evolution. Although a lot has been done to refine the E macrohaplogroup tree, sampling representative populations, like Eritreans, may still shed light on new dimensions of the history of populations bearing these mutations. Despite a single attempt to study Eritrean populations from the diaspora,18 no systematic analysis has been done so far to address the genetic diversity of extant Eritrean populations pertinent to questions like the origin of the Afro-Asiatic languages and pastoralism in light of the distribution of E macrohaplogroup as a case study.

Materials and methods

Y-chromosome genotyping of bi-allelic markers

A total of 1214 Y chromosomes, positive for E haplogroups, were considered in the analysis. Out of an original sample of Eritrean males screened, 39 Y chromosomes (49%) turned to be positive for E markers and were included in the analysis. The language affiliation and present or past history of the populations analyzed are given in Supplementary Table S1. The culture is taken within the context of the current linguistic affiliation and information of past and present subsistence practices. The history of pastoralism is not restricted to cattle as it has been shown that livestock may change according to environment as the case with Baggara Arabs who were originally camel herders turned to cattle. Appropriate informed consent was obtained from all participants. DNA samples were obtained from buccal specimen using phosphate-buffered saline and DNA extraction was carried out according to Miller et al,19 with minor modifications. The bi-allelic variability at Y-chromosome-specific polymorphisms E-M107, E-M123, E-M148, E-M191/P86, E-M200, E-M281, E-M329, E-M33, E-M34, E-M54, E-M81, E-P72, E-U175, E-V19, E-V32, E-V6 (Y Chromosome Consortium (YCC), 2008),20 and E-M 7821 were used to generate MSY chromosome haplotypes. Primers for genotyping were selected according to Karafet et al20 and Underhill et al22 and the references herewith. Most of the genotyping was done at BGI Laboratory (Hong Kong, China). Published data of African, Asian and European populations22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34 were used alongside population data from this study for comparative analysis.

Y-chromosome haplogroup tree

The Y-chromosome haplogroup tree has been constructed manually following YCC 2008 nomenclature20 with some modifications.35 The tree (Supplementary Figure S1) contains the E haplogroups of Eritrean populations from this study and those reported in the literature.22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34 Genotyping results for E-V13, E-V12, E-V22 and E-V32 reported for Eritrean samples and elsewhere23, 27 were retracted to E-M78 haplogroup level. All the analyses in this study were done at the same resolution using the following 17 bi-allelic markers: E-M96, E-M33, E-P2, E-M2, E-M58, E-M191, E-M154, E-M329, E-M215, E-M35, E-M78, E-M81, E-M123, E-M34, E-V6, E-V16/E-M281 and E-M75.

Phylogenetic analysis

Median-joining network (Figure 1) was constructed using Network 4.6.1.1. (http://www.fluxus-engineering.com) program.36 Effective mutation rate of 3.8–4.4 × 10−4 mutations/nucleotide/generation37 was used in estimating the divergence time of the populations using Network program.

Figure 1
figure 1

Median-joining (MJ) network. Network manipulated to fit the geography of the extant populations. MJ network was constructed using E haplogroup frequencies. Group represented by ITAL contains all the Italian samples pooled. Populations’ descriptions are given in Supplementary Table S1.

A neighbor-joining (NJ) tree (Figure 2) was constructed38 and evolutionary analysis was conducted using MEGA5 (Tamura et al39).

Figure 2
figure 2

NJ tree based on FST values generated from Arlequin 3.11. Population names are as given in Supplementary Table S1. Population life style: circle – agriculturalists; square – pastoralists; triangle – nomads; inverted triangle – nomadic pastoralists; diamond – agro-pastoralists. The populations are colored according to their language family: red – Afro-asiatic; blue – Nilo-Saharan; green – Niger-Kordofanian; yellow – Khoisan; black – Italic and Basque.

Genetic structure and population differentiations

Multi-dimensional scaling (MDS) and principal component analysis (PCA) were performed by using PAST (paleontological statistics) algorithms version 2.11 software (available online at http://folk.uio.no/ohammer/past)40 based on FST values generated from Arlequin ver3.11 program.41 Analysis of molecular variance (AMOVA) was performed to verify statistical differences between linguistic and geographic groups. Haplotype frequencies and molecular differences of Y chromosome among haplogroups were taken into account. FST values were calculated based on the number of pairwise differences between Y-chromosome haplogroups. All calculations were performed using Arlequin version 3.11 (Excoffier et al41) using the 17 bi-allelic data listed above. The correlation among genetic, linguistic and geographic distances was assessed by the Mantel test using FST matrices resulted from Arlequin analysis.

Results

Y-chromosome haplogroup diversity

Bi-allelic frequencies of E haplogroups for the populations involved in the analysis are given in Supplementary Figure S1 following the recent nomenclature.20

Phylogenetic analysis

The network analysis on the chromosomes carrying E haplogroups was robust enough with a main cluster near the root represented by Kunama (KUN) encompassing most of Eritreans and Sudanese populations, including Nilo-Saharan and Afro-Asiatic speakers suggesting that linguistic divergence is either a subsequent event to population divergence, language replacement or that the two linguistic families may have shared a common origin. The Southern African populations, which include the Khoisan and Bantu of South Africa populations, are shown to be divergent from the East African larger cluster through its connection to the Somali population. The network also suggests that dispersal of the haplogroup to Southern Africa may reflect the spread of pastoralism from North East Africa.5 The Yemeni, Saudi Arabia and Oman populations on the other hand form a Near Eastern group. The link between the Yemeni and Omani populations with Afar and Saho populations from Eritrea could be attributed to the geographical proximity and possibly past genetic history. The Northern African populations tend to separate into two distinct groups: one containing Moroccan Arabs and Berbers and Saharawi, derived from the larger East African group and the other includes the Northern African populations of Algeria, Egypt and Tunisia, which forms a connection to both Europeans and Eritrean and Ethiopians hinting to recent genetic relationship between North and East African populations as is widely believed.30

The NJ tree, which was not rooted, on the other hand was quite robust in showing similar grouping to that of the network, MDS and PCA plots to imply a correlation with language and relevance to geography. With few exceptions, all populations carrying the haplogroup were either pastoralists or had recorded history of pastoralism. The populations that made exceptions, includes Hausa, Fur and Masalit, have strong agricultural practices, while the latter is thought to have recent history of mixed farming or foraging. The other exceptions are Copts from Egypt and Tigrigna from Eritrea, both with documented history of agricultural practices albeit historically part of larger communities with established pastoralist practices. The Nilo-Saharan speakers and Niger-Kordofanian were confined to the cluster from Sudan and Eritrea.

Genetic structure and population differentiation

The MDS and the PCA plots (Figure 3 and Supplementary Figure S2, respectively) generated from the E haplogroup frequency data portrayed similar pattern that complement the network result. Generally four main clusters can be identified from the MDS and PCA plots. In the MDS plot, one of the main clusters (grey shaded) constitutes almost all Eastern Africans including most Eritrean and Sudanese populations. The Saho and Afar populations of Eritrea tend to cluster with the Near Eastern or Arabian populations (brown shaded). The West and Southern African populations (blue shaded) form the third cluster, while North African populations forming the fourth cluster (green shaded). Interestingly, populations from Egypt, Tunisia and Ethiopia (Ethiopian Jews) assumed an intermediate position between the East African and Near Eastern clusters. The PCA (Supplementary Figure S2) also gave the same result clustering the majority of East Africans (grey shaded) in the first component and North Africans (brown shaded) separated from Middle East populations (blue shaded) in the second component. The first two components account 83% of the variations observed.

Figure 3
figure 3

MDS plot based on the FST values generated from Arlequin 3.11 and using Rho similarity measures and with stress value of 0.07101. Populations’ descriptions are given in Supplementary Table S1.

As indicated in the AMOVA summary (Table 1), when Eritrean populations were grouped according to geographic location, most of the genetic variance (82.11%) was found within populations; a value that is similar to that obtained (82.44%) when the populations were grouped according to their linguistic affiliation. Variance among populations within the linguistic groups was 14.71%, which is slightly higher than the variance among the geographic groups (13.17%). The genetic variance among the linguistic and geographical groups was 2.85% and 4.71%, respectively. AMOVA analysis was also carried out to see the variation of populations from this study and from published works in relation to their linguistic and geographic affiliation. For this purpose, populations were grouped as Afro-Asiatic, Nilo-Saharan and Niger-Kordofanian with the exclusion of Middle East and Europeans populations in both cases. Most of the genetic variance (52.66%) was found to be within populations. The genetic variance among groups and populations within groups were 18.73% and 28.66%, respectively. The AMOVA result after grouping the population into North, South, West and East Africans was different from grouping the populations according to their linguistic affiliation. The results were 25.89% variance among groups, 19.63% among populations within groups and 54.48% within populations. Mantel test showed no correlation between geographical isolation and linguistic affiliation of the populations and their genetic distance.

Table 1 Summary of AMOVA analysis

Discussion

The Sahel, which extends between the Atlantic coast of Africa and the Red Sea plateau, represents one of the least sampled areas and populations in the domain of human genetics. The position of Eritrea adjacent to the Red Sea coast provides opportunities for insights regarding human migrations within and beyond the African landscape.

Worth noting in the current data set is the absence of differentiation of Eritrean populations along their geographical and linguistic affiliation, which may be a reflection of their admixture42 or a common founding population with subsequent drift. Sharing of derived alleles for E and other more deep Y-chromosome lineages (unpublished data) of Eritreans with other populations from the region renders this part of East Africa a likely scene for some of the earliest demographic episodes within as well as subsequent expansion off the continent; a scenario that seems to corroborate paleontological, archeological and genetic evidences.9, 43, 44, 45

The network cluster associated with the Eritrean Nilo-Saharan Kunama (Figure 1) may represent an expansion event following the out-of-Africa migration,31, 46 possibly close to the origin of the ancestral Y-chromosome clades.47, 48, 49 The expansion, carrying the diversified E-P2 mutation, may be responsible for the migration of male populations to different parts of the continent and henceforth the rise and spread of the bearers of the macrohaplogroup.50 These type of population movements, or demic expansions, driven by climatic change and/or spread of pastoralism and to some extent agriculture,51, 52, 53, 54 are not uncommon in human history. This scenario is more substantiated by the refining of the E-P2 (Trombetta et al35) and its two basal clades E-M2 and E-M329, which are believed to be prevalent exclusively in Western Africa and Eastern Africa, respectively.

Interestingly, this ancestral cluster includes populations like Fulani who has previously shown to display Eastern African ancestry, common history with the Hausa who are the furthest Afro-Asiatic speakers to the west in the Sahel, with a large effective size and complex genetic background.23 The Fulani who currently speak a language classified as Niger-Kordofanian may have lost their original tongue to associated sedentary group similar to other cattle herders in Africa a common tendency among pastoralists. Clearly cultural trends exemplified by populations, like Hausa or Massalit, the latter who have neither strong tradition in agriculture nor animal husbandry, were established subsequent to the initial differentiation of haplogroup E. For example, the early clusters within the network also include Nilo-Saharan speakers like Kunama of Eritrea and Nilotic of Sudan who are ardent nomadic pastoralists but speak a language of non-Afro-Asiatic background the predominant linguistic family within the macrohaplogroup.

The subclades of the network some of which are associated with the practice of pastoralism are most likely to have taken place in the Sahara, among an early population that spoke ancestral language common to both Nilo-Saharan and Afro-Asiatic speakers, although it is yet to be determined whether pastoralism was an original culture to Nilo-Saharan speakers, a cultural acquisition or vice versa; and an interesting notion to entertain in the light of the proposition that pastoralism may be quite an antiquated event in human history.17 Pushing the dates of the event associated with the origin and spread of pastoralism to a proposed 12 000–22 000 YBP, as suggested by the network dating, will solve the matter spontaneously as the language differences would not have appeared by then and an original pastoralist ancestral group with a common culture and language50 is a plausible scenario to entertain. Such dates will accommodate both the Semitic/pastoralism-associated expansion and the introduction of Bos taurus to Europe from North East Africa or Middle East.55 The network result put North African populations like the Saharawi, Morocco Berbers and Arabs in a separate cluster. Given the proposed origin of Maghreb ancestors56, 57, 58, 59 in North Africa, our network dating suggested a divergence of North Western African populations from Eastern African as early as 32 000 YBP, which is close to the estimated dates to the origin of E-P2 macrohaplogroup.30, 60 It can be further inferred that the high frequency of E-M81 in North Africa and its association to the Berber-speaking populations25, 30, 32, 60, 61 may have occurred after the splitting of that early group, leading to local differentiation and flow of some markers as far as Southern Europe.30, 60, 62

A branching in the network may once again represent an episode of human migration that carried the haplogroup E-M35 and its subhaplogroups farther to the western coast of the Red Sea to Yemen, Oman and Saudi Arabia and concurrently down to Southern Africa as part of a more recent Y chromosome motivated out of Africa migration episode.

The PCA and MDS display similar interesting grouping of the Afar and Saho populations of Eritrea with their Near Eastern Arabian populations to conjure up on the genetic relationship of the two sides of the Red Sea. The arrival of the E-M35 and derived subclades, for example, E-M123/E-M34, to Arabia appears to be strongly linked to expansion into East Africa, North Africa, Europe, Southern Africa, an event that is likely related to pastoralism, hastened by its advent and amenable for analysis and dating using approaches similar to what was proposed for the co-migration of Y chromosome and disease traits.63

The presence of archeological10, 11, 12, 13 and agro-pastoral9, 14, 16 evidences from this side of the Red Sea and the history of migration of animals across the Red Sea,64 however, calls for more molecular dissection of common haplogroups shared by these coastal populations. As suggested by others, this may give clues not only to the origin of E-M123, J-M267, K-M70, but also to the origin of Semitic languages.65, 66 Indeed the trail of such historical movements are detectable by molecular signatures of markers like Y chromosome giving insights into episodes of even more regional nature, for example, the high frequency of E-V32 in Eritrea, in concordance to oral history, supports the historical ties between North East Africa (Egypt) and East Africa including Eritrea, Sudan, Ethiopia and Somalia.

Conclusion and future work

Although most of the data sets in our study define the deep ancestry of the phylogeny, they still shed some information to our interpretations of recent phenomena such as the current genetic diversity of the E haplogroup in an implication to the origin and spread of Afro-Asiatic languages and to the history of pastoralism.67 Such perspectives, however, should be tested by using more recently derived markers5, 47 within the major haplogroups to explain the archeological findings and the historical and current demography of the region. Moreover, more comparative genetic analysis between the two sides of the Red Sea, specially emphasizing on E-M123/E-M34 or E-M78 haplogroups, will not only refine the route of exit of H. sapiens sapiens from East Africa but also the genealogies of Afro-Asiatic languages in the region.