Tracing the origin and expansion of pastoral nomadism in the Middle East has widespread significance for understanding the development of the civilizations of the ancient Near East and the spread of the Semitic languages throughout the Levant, the Arabian Peninsula and Mesopotamia. Y-chromosome analyses of modern populations of the Middle East can contribute to the delineation of the demographic and migration processes in this region. The predominant categories of Y chromosomes in this region are varieties associated with haplogroup J-M304. This haplogroup essentially bifurcates into two main sub-clades, J1-M267 and J2-M172.1

Previous studies of J1-M2672, 3, 4, 5, 6, 7 have found it to occur at high frequencies among the Arabic-speaking populations of the Middle East, conventionally interpreted as reflecting the spread of Islam in the first millennium CE.8 However, before the middle first millennium CE, a variety of Semitic languages were spoken throughout the Middle East. Recently, historical linguists9 have constructed novel classification trees of the Semitic languages in which the first split from the root of Proto-Semitic separated into East Semitic (Akkadian, Assyrian, Babylonian and Eblaite) and West Semitic. West Semitic then partitions into Ethiopic, Modern South Arabian (spoken in areas of Oman and Yemen) and the core cluster of Central Semitic. Central Semitic would then include the languages of Yemen (Old South Arabian), Arabic and the Northwest Semitic languages of the Levant – Ugaritic, Hebrew, Phoenician and Aramaic.9, 10, 11, 12 Not only have linguists reconstructed the phylogeny of Semitic languages, but also they have dated Proto-Semitic's age to the Chalcolithic Era, circa 5500-3500 BCE.13 In addition to the common Semitic language substrate found throughout the Levant and Arabian Peninsula, recent archeological studies have shown an early presence (ca. 6000–7000 BCE) of domesticated herding in the arid steppe desert regions.14

We recently showed an inverse correlation between J1-M267 frequency and mean annual rainfall in the Middle East populations.15 This finding was interpreted as a founder effect associated with small groups of Neolithic herder–hunters moving into the arid regions of the Arabian Peninsula with a pastoral economy, whereas another ancestral population with a closely associated sister clade, J2a-M410, remained mainly in the regions of the Fertile Crescent that had sufficient rainfall to support a Neolithic farming economy. Although humidity levels fluctuated during the Holocene, the present climatic regime in Arabia was established 5000 years ago.16 Marginal habitats such as desert regions that were plausibly colonized by a few founders result not only in reduced genetic diversity but may also reduce linguistic diversity as evidenced by the broad geographical footprint of the Arabic language in the arid regions of the Middle East.

Although considerable sub-haplogroup diversification has been previously described within the J2-M172 clade,17 the occurrence of J1-M267 affiliated subtypes at frequencies exceeding a few percentage has not yet been reported.18 Here, we present the phylogeographical and haplotype diversity data from a major sub-clade of J1-M267 that is defined by the J1e-Page08 (aka P58) SNP.19, 20

Hereafter, we shall refer to this major sub-clade as simply J1e.

Materials and methods

The nomenclature used for haplogroup labeling is in agreement with YCC conventions and a recent update.16 All samples designated as haplogroup J1 were determined to be derived at M267. Chromosomes labeled as J1* are J1(xJ1e). Our study involves a total of 553 haplogroup J1 samples involving 38 populations (Supplementary Table 1). These distribute to 494 J1e-derived and 59 J1* samples. The majority of the samples were experimentally analyzed for the haplogroup J1e-defining SNP by either RFLP or DHPLC methodology, except for 55 reported as being of J1 membership from the Sudanese from Khartoum; Amhara from Addis Ababa, Ethiopia; and Iraqis from Nassiriya.18 These were inferred to belong to J1e based on companion YSTR haplotype data. The criteria to deduce J1e status involved the filter of DYS388 ≥15 repeats and YCAII A, B allele sizes of either 19, 22 or 22, 22. The haplotype data used in our analyses are given in Supplementary Table 2. The following eight loci, DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393 and DYS439, were used to estimate expansion times using the methodology described by Zhivotovsky et al.21 as modified according to Sengupta et al.17 A microsatellite evolutionary effective mutation rate of 6.9 × 10–4 was used. Networks were constructed by the median joining method using Network, where ɛ=0 and microsatellite loci were weighted proportionally to the inverse of the repeat variance observed in each haplogroup.22 For J1* chromosomes, the network included the following nine loci: DYS19, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393 and DYS439. With the exclusion of DYS388, the same eight STR loci were also used to construct a network for haplogroup J1e-affiliated chromosomes.

Results and discussion

Figure 1a shows the geographical location of populations included in this study. J1* chromosomes have their maximal frequency in the Taurus and Zagros mountain regions of Eastern Anatolia, Northern Iraq and Western Iran (Figure 1c). It is noted that the J1* chromosomes frequently appear in combination with the 12 or 13 repeat pattern at DYS388, whereas the J1e chromosomes almost always display 15 or more repeats. Therefore, the J1e SNP information supports the previous inference that J1 chromosomes linked with DYS388=13 repeats share a common ancestry.1 Network analysis of J1* chromosomes (Figure 2a) show a bifurcating substructure. One cluster is associated with DYS388=15 and DYS390 >23 repeats and the other cluster with DYS388=13 repeats. The locale of highest J1* frequency occurs in the vicinity of eastern Anatolia (Figure 1c). Both J1* and J1e occur in Sudan and Ethiopia (Supplementary Table 1). Our data show that the YCAII 22-22 allele state is closely associated with J1e (Supplementary Table 2). Interestingly, in Ethiopia, all Cushitic Oromo and 29% of Semitic Amharic J1 chromosomes are J1*.

Figure 1
figure 1

(a) Red symbols indicate the geographical locations of 36 populations analyzed. (b) Interpolated spatial contours of annual precipitation (mm) distribution. (c) Interpolated J1* frequency spatial distribution. (d) Interpolated J1e frequency spatial distribution. (e) Interpolated J1e mean haplotype variance spatial distribution. (f) Construed trajectories of J1e lineage spread episodes. In red are delineated the initial Holocene migrations from the Taurus/Zagros Mountains to the Arabian Peninsula. Shown with black arrows are the subsequent expansions of Arabic populations in Arabia beginning in the Bronze Age.

Figure 2
figure 2

(a) Median-joining network for J1* using the nine-locus Y-STR haplotypes. Networks were weighted according to Qamar et al.22 Loci analyzed included DYS19, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393 and DYS439. (b) Median-joining network for eight-locus (excluding DYS388) Y-STR haplotypes for J1e.

Table 1 shows the average variance and expansion times of J1e with their linguistic and archeological correlates from those populations with five or more samples; the Assyrians of Syria, Iraq, Turkey and Iran were amalgamated into one group and the Arab populations of Qatar, UAE and Saudi Arabia were also combined. The mean variance across the 19 populations in Table 1 correlates significantly with latitude (r=0.36, P<0.035, two-tailed Kendall's τ) and nonsignificantly with longitude (r=0.02, NS). This result supports the hypothesis that the origin of J1e is likely in the more northerly populations in Table 1 and spreads southward into the Arabian Peninsula (Figure 1f). The high YSTR variance of J1e in Turks and Syrians (Table 1, Figure 1e) supports the inference of an origin of J1e in nearby eastern Anatolia. Moreover, the network analysis of J1e haplotypes (Figure 2b) shows that some of the populations with low diversity, such as Bedouins from Israel, Qatar, Sudan and UAE, are tightly clustered near high-frequency haplotypes suggesting founder effects with star burst expansion in the Arabian Desert.

Table 1 J1e and J1* expansion time, mean YSTR variance based on DYS19, DYS390, DYS391, DYS392, DYS393, DYS389I, DYS389II and DYS439, linguistic and archeological correlates by population

The series of expansion times (Table 1) is also consistent with a subsequent Neolithic range expansion of J1e from a geographical zone, including northeast Syria, northern Iraq and eastern Turkey toward Mediterranean Anatolia, Ismaili from southern Syria, Jordan, Palestine and northern Egypt. Although there is a trend between the mean variances and the expansion time estimates, the latter do not uniformly increase with variance (Table 1) as some populations likely have more than one J1e founder. Support for this explanation involves cases in which there is the presence of two distinct varieties of YCAII chromosomes, namely, 19, 22 and 22, 22, whereas those with low mean diversity typically just reflect the 22, 22 class (Supplementary Table 2). A network analysis of J1e chromosomes (Figure 2b) also reflects situations of multiple founders.

Although the haplogroup diversification within J1e remains incomplete, the somewhat rare J1e1-M368 provides an insight into the geographical origin of J1e. It has been reported both in the Black Sea region of Turkey1 and Dagestan in the northeast Caucasus.18 Furthermore, J1e1-M368 displays the YCAII 19-22 pattern. Although the haplogroup relationships of YCAII alleles are unstable, nevertheless in the context of haplogroup J1, they are suggestive that the prevalent YCAII 22-22 variety may have evolved from a YCAII 19-22 ancestor.

Table 1 lists the current languages and the first millennium BCE Iron Age languages spoken in the geographical regions from which the samples were collected. Tracking back to the Iron Age, all the branches of the Central Semitic languages are represented – NW Semitic, Arabic and Old South Arabian in the Levantine and Yemeni sampling regions. The Assyrian samples and Iraqi Kurdish samples have been drawn from areas in Northern Mesopotamia speaking East Semitic languages at the time. The current data suggest an origin of J1e in the general area of eastern Turkey/northern Iraq associated with the Zarzian horizon,23, 24, 25 as they have similar early pre-agricultural expansions (16 kya, Table 1).

The timing and geographical distribution of J1e is representative of a demic expansion of agriculturalists and herder–hunters from the Pre-Pottery Neolithic B to the late Neolithic era.24, 26 The higher variances observed in Oman, Yemen and Ethiopia suggest either sampling variability and/or demographic complexity associated with multiple founders and multiple migrations. The expansion time associated with Yemen is somewhat older (7000 BCE) and may reflect a migration of herders into southern Arabia.27 Finally, the more recent expansion times (Table 1) observed in Arabs from the Arabian Peninsula, Negev Bedouins and Sunni Arabs from Hama, Syria, are consistent with a subsequent Chalcolithic/Early Bronze Age (3000–5000 BCE) advance of J1e to the Arab populations of Arabia from near the early attested Arabian-speaking area of Tayma in north central Arabia28, 29 (Figure 1f).

A comparison of the mean annual rainfall and spatial frequency distribution of J1e (Figures 1b and d respectively) indicates J1e peaks in the arid regions of the Arabian Peninsula. We performed a nonparametric Mann–Whitney test to address the hypothesis: is the frequency of J1e higher in arid regions (≤300 mm) compared with regions with more rainfall in our sample set of African and Near Eastern populations? We found that the frequency of J1e was significantly greater in the arid than in the non-arid populations (P=0.0035). By combining all the arid populations (Supplementary Table 1) into one sample (n=16), we circumvented the details of the geographic frequency distribution, such that the J1e frequency pattern was examined primarily with regard to precipitation rather than geography, although the two are correlated.

Although most post-Last Glacial Maximum recolonization events have a typically northward signature,30, 31 our J1e results provide an example of a southward spread during the early Holocene. Although J1e is one of the most frequent haplogroups in the region, haplogroup E-M123 also shows its highest frequency and haplotype diversity in regions of the Fertile Crescent, decreasing toward the Arabian Peninsula.1, 2, 6 This co-distribution pattern of Y-chromosome haplogroups J1e and E-M123 resembles mtDNA haplogroups J1b and (PreHV)1 distributions that also display low levels of diversity despite their high frequency in Saudi Arabia.32, 33

Although on a broad scale the haplogroup J1e frequency distribution and expansion times are consistent with the model that it tracks a possible expansion of Neolithic agro-pastoralists from the Fertile Crescent into the arid Arabian Peninsula, several caveats must be considered. First, the patchy distribution of J1e frequency in the Levant (Syria, Jordan, Israel and Palestine) may reflect the complex demographic dynamics of religion and ethnicity in the region. Second, even though the highest YSTR variance of J1e lineages is in eastern Anatolia, northern Iraq and northwest Iran, one cannot entirely rule out recent admixture as a contribution to the high variance among ethnic Assyrians.

A recent Bayesian analysis of Semitic languages supports an origin in the Levant 5750 years ago and subsequent arrival in the Horn of Africa from Arabia 2800 years ago,11 thus providing an indirect support of our phylogenetic clock estimates. It is important to note that the glottochronological dates yield estimates for the break-up and expansion of the Proto-Semitic language. Proto-Semitic, itself, may have been spoken in a localized linguistic community for millennia before its bifurcation into the East and West Semitic branches. In summary, haplogroup J1e data suggest an advance of the Neolithic period agriculturalists/pastoralists into the arid regions of Arabia from the Fertile Crescent and support an association with a Semitic linguistic common denominator.14

Conflict of interest

The authors declare no conflict of interest.