Introduction

Located at the crossroads of Europe and the Middle East, the Armenian plateau has long served as both a recipient of and potential conduit for gene flow between the two regions. Although archaeological evidence for human as well as Neanderthal activity in Armenia during the Paleolithic era exists,1, 2 the Last Glacial Maximum (LGM) likely made permanent settlements of the region infeasible3 until the glacial recessions between 16 and 18 kya.4 While Mesolithic sites in Sasun are known,5 the improving climatic conditions during this period allowed the Armenian plateau to gradually transform into a region characterized by bountiful water supply and wealth of fertile plains.6 The transition from the LGM to a hospitable landscape that would support stable and detectable human occupation is expected to have progressed over millennia. These conditions as well as its proximity to the Fertile Crescent catalyzed the region's emergence as one of the earliest recipients of agriculture (∼8 kya) during the Neolithic Revolution.7, 8 In addition to the relatively early appearance of agriculture in Armenia, the Armenian highlands seem to have been instrumental in the dispersal of obsidian,9 leather footwear10 and viticulture;11 technologies that would later acculturate across the Near East and eventually enter Europe. The extent to which the Armenian gene pool has been structured and influenced by events during and since the Neolithic, however, remains unknown.

Presently, Armenians are characterized as a distinct ethnotype, speaking a single Indo-European language, Armenian. Linguistic analyses have found that Armenian represents one of the oldest living Indo-European languages and exhibits its greatest affinities with Greek and Balkan languages.12 With some linguists placing the origins of the Proto-Indo-European and Indo-European languages in either Anatolia or Transcaucasia,12 it has been proposed that Armenians represent close descendants of the ancestral Indo-European population and that subsequent migrations from Armenia into Greece were responsible for the language group's dispersal into Europe.13, 14 However, a lack of archaeological support for this notion has led to the alternative supposition that invasions from Balkan or Anatolian tribes15 introduced an Indo-European language into Armenia, resulting in the observed similarities between Armenian and the Southeastern European languages.16, 17

In the efforts to gain a comprehensive understanding of the impact that complex historical migrations and events have had upon the genetic structure of populations, the human Y-chromosome has emerged as a highly effective tool.18 Prior examinations of the paternal lineages within Armenia have revealed population-expansion times corresponding to the Neolithic emergence of agriculture,19 as well as genetic affinities toward both Near Eastern and European populations;20, 21 results that are largely corroborated by mitochondrial DNA22 and Alu insertion (PAI) studies.23

It should be noted, however, that the above-mentioned patrilineal studies are hindered by their utilization of a limited set of Y-chromosomal markers that severely restricted their ability to define phylogenetic relationships. In particular, none of the previous paternal investigations of Armenia resolved the predominantly European haplogroup R beyond the level of R1*(xR1a), which leaves the precise relationship of Armenians to Europeans ambiguous. In the current study, we perform, for the first time, a high resolution analysis of the Y-chromosome structure in four distinct Armenian populations that encompass the breadth of historical Armenia to investigate the role the region had in the Neolithic and Indo-European expansions, as well as finely assess the effects of Armenia's complex history upon its patrilineal genetic architecture.

Materials and methods

Sample collection and DNA isolation

Buccal swabs were acquired with informed consent from 413 unrelated individuals under the ethical guidelines asserted by the Florida International University Institutional Review Board. These individuals, collected from four distinct regions of the historical expanse of Armenia (Figure 1), the Ararat Valley (n=110), Gardman (n=96), Lake Van (n=103) and Sasun (n=104), were assessed to be ethnic Armenians for at least two generations. DNA was isolated with the Gentra Buccal Cell Kit using the protocol described in the manufacturer's specifications (Puregene, Gentra Systems, Minneapolis, MN, USA).

Figure 1
figure 1

Geographic locations of the Armenian populations examined. The dashed line indicates the boundary of the greatest extent of historical Armenia.

Y-SNP and Y-STR genotyping

A total of 70 Y-chromosomal binary markers (Figure 2) were hierarchally examined for the purpose of categorizing each sample into particular Y-haplogroups that can be used to trace phylogenetic relationships. Genotypes were identified using standard PCR–RFLP, allele-specific PCR and the YAP PAI24, 25 procedures. Y-haplogroup nomenclature was assigned as in Karafet et al26 and Myres et al.27 To determine Y-STR haplotypes for individuals belonging to haplogroups R1b-M343, J2-M172 and T-M184, 17 loci (DYS19, DYS385 a/b, DYS389I/II, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, DYS635 and Y-GATA H4) were amplified using the AmpFSTR Yfiler PCR amplification kit (Applied Biosystems). The number of repeats at each locus was determined by first separating the PCR products in an ABI Prism 3100 × 1 Genetic Analyzer (Applied Biosystems, Foster City, CA, USA) and then analyzing the fragments with the Genemapper v3.2 program. The size of the DYS389I locus was subtracted from that of the DYS389II locus for all analyses.

Figure 2
figure 2

Y-haplogroup phylogeography within Ararat Valley, Gardman, Lake Van and Sasun. The following markers were typed, but were found to be non-polymorphic: M84, M290, M286, P18, P37, M62, M365, M390, P56, M367, M369, M322, M68, M138, M158, M289, M318, M339, M340, M419, P81, M76, M357, M175, M458, M412, M320 and P77.

Statistical and phylogenetic analyses

A list of the 25 previously published populations employed in all the phylogenetic analyses, except the MDS based on haplogroup frequencies (see Supplementary Table 1 for the populations utilized in this analysis), is provided in Table 1. Pairwise genetic distances (Fst) as well as Nei's genetic diversity estimates were calculated with the Arlequin v3.5 program (Genetics and Biometry Laboratory, University of Geneva, Geneva, Switzerland).28 The frequencies of the complete set of detected haplogroups were used to determine pairwise Fst distances among the four Armenian populations. MICROSAT29 (Stanford University Medical Center, Stanford, CA, USA) was used to generate Rst distances from STR allelic frequencies in haplogroups Rlb and J2 in Armenia, as well as populations containing appreciable frequencies of haplogroups R1b and J2. These distances were utilized to generate Multidimensional Scaling Analysis (MDS) plots using SPSS v18.0 software (SPSS, 2001, SPSS Inc., Chicogo, IL, USA).

Table 1 Populations analyzed

Network diagrams and time estimations

The phylogenetic relationships of the eight loci haplotypes of all individuals within haplogroups R1b, J2 and T were ascertained by creation of median-joining (MJ) network diagrams30 using the NETWORK 4.5. 1.6 software (Fluxus Engineering, Clare, UK). An additional network projection that included only individuals who were found to belong in either haplogroup R1b1b*-M269 or R1b1b1*-L23 was generated to attempt to discern the genetic relationship between Armenians and Europeans with the ancestral M412 allele. Haplotypes were created from the aforementioned set of eight loci available for each reference population and locus weights were assigned as described previously.31

Expansion times for haplogroups R1b, J2 and T were estimated from the eight STR loci dataset using the method described in Zhivotovsky et al32 and modified by Sengupta and collaborators.33 Genealogical mutation rates calculated from family based studies defined for each locus in Ravid-Amir and Rosset34 and Goedbloed et al,35 with an intergeneration time of 25 years were utilized. Owing to the marginal differences in the time estimates generated using both family studies, we provide, throughout the narrative, the expansion values based on Ravid-Amir and Rosset34 mutation rates. We have recalculated the time estimates for the reference populations with the genealogical mutation rate and the Zhivotovsky et al32 method to allow direct comparions of dates. Because of the limited availability of haplotypes within haplogroup T for several of the reference populations, time estimates were calculated using the Levant and Turkey as single groups. Furthermore, because of the elevated levels of haplogroup T in Sasun relative to the other three collections two dates were calculated from Armenian populations, one for Ararat Valley, Gardman and Lake Van, collectively, and a second for Sasun. In addition, spatial gradient maps were generated to illustrate haplogroup frequency and STR allelic variance clines in both haplogroups R1b and J2 using the program Surfer v.9.0.343 (http://www.goldensoftware.com).

Results

Y-haplogroup distribution

For all four Armenian populations analyzed in this study, the most prevalent major haplogroups (Figure 2) are R-M207 (38%, 36%, 33% and 34% in Ararat Valley, Gardman, Lake Van and Sasun, respectively) and J-M304 (38%, 36%, 43% and 27%, respectively). Of the lineages within haplogroup R, the largely Near Eastern27 R1b1b1*-L23 predominates in Ararat Valley, Gardman and Lake Van (33%, 31% and 32%, respectively). Furthermore, in Ararat Valley we find five individuals belonging to the paraphyletic haplogroup R1b1b*-M269. The Sasun collection, meanwhile, contains comparable distributions of haplogroups R1b1b1*-L23 (15%) and R2-M124 (17%). It should be noted that only low frequencies of haplogroup R1a1*-M198, which has been associated with the Indo-Aryan expansions,36 were observed in Ararat Valley (0.9%), Gardman (5.2%) and Sasun (0.9%).

Within haplogroup J-M304, the majority of the collected Armenians belong to the J2a-M410 clade, with the most common lineage being J2a*-M410 (9%, 15%, 14% and 10% in Ararat Valley, Gardman, Lake Van and Sasun, respectively), followed by J2a2*-M67 (4%, 7%, 7% and 8%) and J2a2a-M92 (9%, 7%, 1% and 0%), each of which are largely restricted to Near and Middle Eastern populations.37 Of the remaining J-M304 lineages, appreciable frequencies of haplogroup J1-M267 are found across all four sampled Armenian populations (14.5%, 4.2%, 13.5% and 9.6% in Ararat Valley, Gardman, Lake Van and Sasun, respectively), whereas minimal levels of the South European38, 39 haplogroup J2b-M12 are found only in Ararat Valley (0.9%), Gardman (3.1%) and Lake Van (1.9%).

In each of the examined Armenian populations, we witness moderate levels of haplogroups T-M184, G-M201 and E1b1b1-M35. Interestingly, haplogroup T-M184, which is relatively rare in other Near Eastern populations, as well as in three of the Armenian collections tested here, represents the most prominent descent in Sasun, comprising 20.1% of the samples. The presence of this haplogroup in Ararat Valley, Gardman and Lake Van, by contrast, is more limited, composing only 3.6%, 6.3% and 3.9%, respectively, of the individuals from those collections. Haplogroup G-M201 (10.9%, 6.3%, 7.8% and 12.5% in Ararat Valley, Gardman, Lake Van and Sasun, respectively), on the other hand, is fairly evenly distributed across the four Armenian populations, with sublineage G2a-P15 comprising the vast majority of these chromosomes. Similarly, the E1b1b1-M35 clade (5.5%, 4.2%, 7.8% and 2.9% in Ararat Valley, Gardman, Lake Van and Sasun, respectively) in the four Armenian populations consists almost entirely of haplogroup E1b1b1c1*-M34. However, a small number of individuals from Lake Van belong to the E1b1b1a*-M78 (0.9%), E1b1b1a2-V13 (1.9%) and E1b1b1a3-V22 (0.9%) branches that have been implicated as signals of Greek influence.40

Pairwise Fst distances between the four Armenian populations based on haplogroup frequencies, as well as their corresponding P-values, are provided in Table 2. These results indicate that the genetic distances separating Ararat Valley, Gardman and Lake Van are not statistically significant (P≤0.001). Sasun, however, exhibits statistically significant divergence from the remaining Armenian populations, most likely as the result of the prominence in Sasun of lineages (T-M184 and R2a-M124) found at substantially lower frequencies in Ararat Valley, Gardman and Lake Van. An MDS plot based on Fst distances generated from Y-haplogroup frequencies is provided in Supplementary Figure 1. However, conclusions yielded from this plot must be tempered, as the low haplogroup resolution generates artificial associations that may not exist with an increased number of markers (ie, the observed affinities between Armenia and Europe because of their high levels of R, despite the lack, in Armenia, of the M412-derived allele that predominates in European populations).

Table 2 Fst distances based on haplogroup frequencies

Haplogroup R1b

Haplotypes for the 120 Armenians belonging to haplogroup R1b-M343 are available in Supplementary Table 2. In addition, the number of observed haplotypes, gene diversity and allelic variance within haplogroup R1b are provided in Table 3. An MDS analysis of Rst distances calculated from individuals possessing the derived R1b-M343 allele are illustrated in Figure 3a. Despite the elevated frequencies of haplogroup R1*(xR1a) in Armenia, likely driving the European influences seen in the MDS plot based on haplogroup frequencies (Supplementary Figure 1), Armenian populations group further from the European cluster than do most of the Near Eastern collections (Figure 3a). Instead, Gardman and Lake Van segregate closest to the Eastern Turkey population, whereas Sasun and Ararat Valley appear to act as outliers, exhibiting substantial separation from the other two Armenian groups, as well as the reference populations.

Table 3 Haplotype diversity, variance and time estimations for Haplogroup R1b
Figure 3
figure 3

MDS analyses based on Rst distances calculated from STR allelic frequencies from eight loci of individuals within haplogroups (a) R1b (Stress=0.14116) and (b) J2 (Stress=0.20418). Stress values <0.23 were considered acceptable as determined by Sturrock and Rocha.56

An MJ network diagram generated from the Y-STR haplotypes of all R1b-M343 individuals is provided in Supplementary Figure 2a. Haplotypes from each of the four Armenian populations are found across a wide distribution of clusters and are commonly associated with haplotypes from Near Eastern populations. However, we note very limited haplotype sharing among Armenian and European populations, an observation congruent with the MDS plot (Figure 3a). Similarly, the network based only on individuals belonging to the R1b1b*-M269 and R1b1b1*-L23 lineages (Supplementary Figure 2b) exhibits a wide distribution of Armenian haplotypes and does not identify a segregation between individuals of Armenian and East European descent, which is not inconsistent with older gene flow between the two regions. Expansion time estimates for haplogroup R1b-M343 are provided in Table 3. Overall, the dates calculated using the genealogical mutation rates from Ravid-Amir and Rosset34 and Goedbloed et al35 are comparable, with the latter yielding slightly older dates and greater standard errors. These estimates seem to indicate that the Armenian expansion times are intermediate to those of the Levant and Europe; a cline illustrated in a spatial gradient map of mean variance within R1b-M343 (Figure 4). In addition, the dates generated from this analysis appear to agree with late to early Neolithic expansions for R1b-M343 across the four Armenian collections and most reference populations. However, owing to the contentions associated with the current calibrations of the Y-STR mutation rates,32, 34, 35, 41 as well as the limitations of the assumptions utilized by the methodologies for time estimations, the absolute dates generated in this study should only be taken as rough estimates of upper bounds.

Figure 4
figure 4

Spatial gradient maps illustrating clines in frequency and mean STR allelic variance (eight loci) within haplogroups R1b and J2.

Haplogroup J2

Supplementary Table 3 provides the haplotypes for the 105 J2-M172-derived individuals. An MDS plot based on Rst distances from individuals under the J2 lineages (Figure 3b) illustrates Gardman, Sasun and Lake Van plotting among the Turkish, Iranian and Levantine populations, while demonstrating segregation away from Crete and Greece. The phylogenetic relationships of haplotypes within haplogroup J2 are displayed in an MJ network in Supplementary Figure 3. As in the R1b-M343 network, the low genetic diversity of Ararat Valley and Sasun is clearly observed in the partitioning of samples from these populations into only a few haplotypes. The relative expansion times for haplogroup J2-M172 (Table 4) generally correspond with those yielded for R1b-M343, with the exception of Greece and Crete, which, unlike haplogroup R1b-M343, are slightly older than the dates yielded for several of the Near Eastern groups as well as the four Armenian populations.

Table 4 Haplotype diversity, variance and time estimations for Haplogroup J2

Haplogroup T

The haplotypes of the 35 individuals belonging to Haplogroup T are provided in Supplementary Table 4. The distribution of haplotypes within Haplogroup T-M184 illustrated in an MJ network (Supplementary Figure 4) depicts a positioning of clusters composed almost entirely of Levantine individuals in the center of the diagram, with all other groups emanating from this area. This potentially points toward an origin of these haplotypes in the Levant, a supposition that is further supported by the observation of high genetic diversity in the network among levantine individuals within haplogroup T-M184 in the Levant. Time estimates for T-M184 individuals (Supplementary Table 5) corroborate the supposition that the lineage originated in the Levant and entered the Armenian Plateau at a later time, as the dates generated from Levantine (6.2±1.8 kya using Ravid-Amir and Rosset34) and Iranian (5.9±1.3 kya) individuals are older than those yielded from Turkey (5.7±1.2 kya), Sasun (5.0±1.2 kya) and the remaining three Armenian populations (average 3.9±1.0 kya).

Discussion

Genetic support for Neolithic origins of the Armenian populations

The results of this study suggest that the majority of Armenian Y-chromosomes belong to lineages believed to have originated and expanded during or following the Neolithic, including E1b1b1c-M123, G-M201, J1-M267, J2-M172 and R1b1b1-L23.27, 38, 42, 43, 44, 45, 39 Previous investigations have found some of these haplogroups, including J1-M267 and J2-M172, to illustrate patterns of distributions that reflect the spread of agriculture and domestication from the Fertile Crescent,37, 45 and recently, the distribution of R1b1b-M269 haplotypes across the Near East and Europe has been proposed to be best explained by a Neolithic dispersal.44 Of particular prominence in Armenia are haplogroups R1b-M343 and J2-M172, which are detected in Ararat Valley, Gardman and Lake Van at frequencies higher than those observed in any of the Near Eastern populations analyzed in this study. STR haplotypes within these two lineages exhibit clear genetic affinities with individuals from the Near East, particularly the Levantine populations of Jordan and Lebanon as demonstrated in the MDS plots (Figures 3a and b) and network diagrams (Supplementary Figures 1 and 2). These affiliations suggest gene flow between the Levant and Armenia, which, given the time estimates calculated, likely coincided with the expansion of agriculturalists. Examination of mean haplotype variance and expansion times for R1b-M343 and J2-M172 revealed less variance and slightly younger dates in Armenia relative to the surrounding regions. Therefore, our data likely reflects a directionality of these migrations from an origin in the Levant before reaching the Armenian highlands. However, more recent gene flow, possibly during the numerous imperial expansions from Assyria, Persia or even Europe, cannot be fully excluded as potential causes for the observed distribution patterns of genetic diversity. Yet, it is difficult to imagine such events leading to the genetic homogeny observed among Ararat Valley, Gardman and Lake Van.

Given the inhospitable climatic conditions4 and paucity of archaeological remains3 in Armenia from the millennia preceding the Neolithic, the predominance of Neolithic Y-chromosomes in Armenia suggests that the region was sparsely settled before the arrival of early farmers.5 Settlement during the Mesolithic, such as those observed near Sasun,4 were likely only fleeting. We envision a hypothetical, yet compelling, scenario in which Neolithic agriculturalists from the Levant occupied a vacant Armenian Plateau. This is reflected in the contour plots in Figure 4, where variance estimates indicate a demic decreasing gradient from the Levant toward Armenia, while frequencies illustrate an inverse cline, supporting the idea that the migrants from the Levant constitute a great proportion of the Armenian population. According to this contention, most Paleolithic genetic signals currently detected in the Armenian highlands represent influx of chromosomes from continuously settled areas and not vestiges from older occupations. This is supported by time estimates for Haplogroup T-M184, a clade believed to have originated in the Near East during the Paleolithic,46 as we observe dates in Armenia (∼12–13 kya) lower than those in other regions of the Near East, including Iran and the Levant (∼20 kya). Such a disparity suggests the Haplogroup T lineages presently observed in Armenia were introduced by migrations that are more recent than those that carried Haplogroup T to the Levant and Iran. Given the difficulties associated with absolute dating, it is also possible that these markers were carried by migrants entering Armenia during an even later era, such as the Bronze Age if the T lineage within genetic contributors has been lost to genetic drift.

Limited evidence for Indo-European genetic associations

Shortly after the arrival of early farmers in Armenia and Anatolia (8 kya), agriculture spread to Greece and the Balkans, before rapidly expanding across Europe.47.Furthermore, the classification of Armenian as an old Indo-European language with similarities to the ancestral Proto-Indo-European languages has led to the supposition that agriculturalists migrating from Armenia into Europe were responsible for the establishment of Indo-European languages in the continent.13, 14 However, despite the close linguistic relationship between Armenians and the Indo-European speaking populations of Europe,12 we see little genetic support for this claim. The derived M412 allele, which is found in nearly all haplogroup R1b1b1*-L23 chromosomes in Europe,27 is absent in the sampled Armenians, which also exhibit a scarcity of haplotype sharing with Europeans, suggesting a limited role for Armenians in the introduction of R1b into Europe.

Several authors have proposed that the Indo-European language presently spoken by Armenians arose during the Bronze Age, when Indo-European speaking tribes from the Balkans and Greece invaded Anatolia and Transcaucasia, leading to the subsequent spread of their culture and language.16, 17 In this study, we have detected a number of lineages that are prominent in the Balkans (I2*, I2b*, J2b1 and J2b2) at low levels throughout Ararat Valley, Gardman and Lake Van, the latter of which also contains haplogroups commonly associated with Bronze Age Greece (ie, J2a8-M319 (4.9%), and E1b1b1-M78 and its sublineages (3.9%)). While this may suggest genetic input from early Greek or Phrygian tribes, it is also possible that these low levels of Balkan lineages arrived in Armenia at a later time, such as during one of the many incursions into the area during the reign of the Macedonian, Roman and Byzantine empires. It should be noted that these results only reflect the paternal history of Armenia and studies on a maternal or gender-neutral system may reveal distinct conclusions.

Although the Armenian paternal gene pool exhibits limited genetic affinities with modern European Y-chromosomes, a paleogenetic study examining the gene pool of Neolithic farmers in Europe (∼6 kya) found that these fossils displayed greater genetic similarity with individuals from the modern Near East than to modern Europe.48 These ancient individuals were characterized by haplogroups G2a and F*(xGHIJK), which, in the present day gene pool, are mostly restricted to the Near East and are not prevalent in modern Europeans. These results suggest that the genetic profile currently observed throughout Europe potentially originated from migrations that took place subsequent to the Neolithic era. Such a scenario may explain the lack of segregation separating Armenians and Europeans that was observed in the network containing only the older lineages, R1b1b*-M269 and R1b1b1*-L23, as these haplogroups may represent remnants of an older European population. As a result, comparisons of the paternal component of Armenians and a larger dataset of ancient European samples may illuminate the nature of the migrations into Europe with greater fidelity.

Conclusion

Our results indicate a prevalence of Neolithic paternal chromosomes that are associated with the Agricultural Revolution, namely E1b1b1c-M123, G-M201, J1-M267, J2a-M410 and R1b1b1*-L23, which collectively comprise 77% (58% in Sasun and an average of 84% in Ararat Valley, Gardman and Lake Van) of the observed paternal lineages in the Armenian Plateau. Furthermore, Y-STR variance and haplotype distributions suggest that these lineages were likely introduced into Armenia from the Levant. However, later migrations, such as from Armenia to Europe, do not appear to have been associated with any paternal gene flow.