Introduction

The male genetic landscape of the European continent has been shown to be clinal and influenced primarily by geography rather than by language.1 One of the most outstanding phenomena in the Y-chromosomal diversity in Europe concerns the population of Poland, which reveals geographic homogeneity of Y-chromosomal lineages in spite of a relatively large geographic area seized by the Polish state.2 Moreover, a sharp genetic border has been identified between paternal lineages of neighbouring Poland and Germany, which strictly follows a political border between the two countries.3 Massive human resettlements during and shortly after the World War II (WWII), involving millions of Poles and Germans, have been proposed as an explanation for the observed phenomena.2, 3 Thus, it was possible that the local Polish populations formed after the early Slavic migrations displayed genetic heterogeneity before the war owing to genetic drift and/or gene flow with neighbouring populations. It has been also suggested that the revealed homogeneity of Polish paternal lineages existed already before the war owing to a common genetic substrate inherited from the ancestral Slavic population after the Slavs’ early medieval expansion in Europe.2

From the linguistic point of view, western Slavic dialects are classified as Czech/Slovak, Lusatian and Lekhitic; the Lekhitic branch is further divided into Polish, Pomeranian and Polabian.4 Nowadays, among the western Slavs, only Polish and Czech/Slovak dialects have evolved into fully viable languages with millions of speakers. Lusatian is spoken by 66 000 Sorbs inhabiting southeastern Germany, down from 166 000 speakers in the late 19th century.5 Present-day Pomeranian comprises 53 000 speakers of Kashubian in northern Poland,6 although roughly half a million people in Poland claim Kashubian and half Kashubian ancestry.7 While Slavists classify Kashubian as a separate Slavic language,4 the vast majority of Kashubes declare Polish ethnicity.6 Polabian was spoken until the 18th century in what is now northeastern Germany.8 The Polish linguistic area is further subdivided into four dialectal groups, roughly corresponding to early Slavic tribal division: Greater Polish, Lesser Polish, Silesian and the most linguistically divergent Masovian.9

There exists an opinion among academics that ‘the Slavic ethnogenesis remains a major, if not the most important, topic in the historiography of Eastern Europe’.10 Most of the current knowledge on this subject results from indirect evidence based on linguistics, archaeology and anthropology, including, since recently, molecular genetics.11 The changes seen in the 5th–6th centuries in eastern Europe are explained either in terms of a demographic expansion of the Slavic people, carrying with them their genes, customs and language, or as a primarily linguistic spread with only minor contribution of migration.12

We used high-resolution typing of Y-chromosomal binary and microsatellite markers first to test for male genetic structure in the Polish population before massive human resettlements in the mid-20th century, and second to verify if the observed present-day genetic differentiation between the Polish and German paternal lineages is a direct consequence of the WWII or it has rather resulted from a genetic barrier between peoples with distinct linguistic backgrounds. The study further focuses on providing an answer to the origin of the expansion of the Slavic language in early medieval Europe. For the purpose of our investigation, we have sampled three pre-WWII Polish regional populations, three modern German populations (including the Slavic-speaking Sorbs) and a modern population of Slovakia.

Materials and methods

A total of 1156 individuals were analysed in the present study, including 520 unrelated males descending directly from pre-WWII native inhabitants of three distinct ethnolinguistic regions of Poland: Kaszuby (Kashubian-speaking region, n=204), Kociewie (Greater Polish-speaking region, n=158) and Kurpie (Masovian-speaking region, n=158). Inhabitants of the Kurpie region trace their origin to Masovian peasants who since the 16th century colonised forests between Masovia and Prussia, and were subjected to some degree of geographic and cultural isolation.9 The Kashubian samples were additionally assigned to three different dialects:9 northern (n=70), central (n=93) and southern (n=41). As genetic distances revealed the three Kashubian subpopulations to be genetically undistinguishable (data not shown), they were treated in many subsequent analyses as one population. Only individuals whose ancestors were born in villages and inhabiting the studied areas for at least three generations in paternal lineages were selected for the study. In addition, a sample set from Germany comprised Sorbs from Lusatia (Upper Sorbian speakers, n=123) and Germans from Mecklenburg (northeastern Germany, n=131) and western Bavaria (southwestern Germany, n=218). Finally, DNA samples from western Slovakia (n=164), used previously in a comprehensive analysis of Y-STR variation in the Slavic populations,11 were also included in the study. The studied populations and their linguistic background are summarised in Table 1, while their geographic locations on an ethnolinguistic map of central Europe in the early 20th century are shown in Supplementary Figure S1.

Table 1 Linguistic affiliations, Y-STR MPD and WIMP values (±SD), and surname distributions for the analysed populations

Two multiplex PCRs were utilised to genotype a total of 19 Y-STRs, including 17 STRs present in the commercially available AmpFlSTR Yfiler PCR Amplification Kit (Applied Biosystems, Foster City, CA, USA). The second multiplex comprised two additional Y-STRs: DYS388 and DYS426, as well as six biallelic markers, displaying amplified fragment length polymorphism: A-M91, BT-M139, B-M60, M-M186, O-M175 and R-M17.13 As the Yfiler kit amplifies two DYS385 loci simultaneously avoiding their discrimination, DYS385 was excluded from all the analyses performed, providing a total of 17 Y-STRs (including DYS388 and DYS426) for inferences. Other Y-SNPs were genotyped individually with the use of pre-designed TaqMan assays with previously published primer sequences.14 Their phylogenetic relationship is shown in Figure 1.

Figure 1
figure 1

Phylogenetic relationship and frequencies of Y-chromosomal haplogroups in the studied populations. Ka Kaszuby; Ko Kociewie; Ku Kurpie; Lu Lusatia; Sl Slovakia; Me Mecklenburg; Ba Bavaria. (1) R-M17-derived samples with unknown M458 status owing to permanent lack of PCR product, which the most likely resulted from deletion of the M458 locus, located in very close proximity to the DYS448 marker (independent deletions of DYS448 have been described within different haplogroups45 and two out of the three samples with unknown R-M458 genotypes possess DYS448 null alleles).

Observed haplogroup frequencies were employed to calculate a matrix of pairwise FST values. Y-STR haplotypes were used to obtain ΦST and RST molecular distances. Calculations of genetic distances, estimations of corresponding P-values based on 10 000 permutations and analysis of molecular variance (AMOVA) were performed with the use of Arlequin 3.1 software.15 In order to thoroughly explore the Y chromosome distribution in the Polish population before and after the WWII, our data were compared with 7-STR haplotypes published for a pre-WWII southern Polish population from the Lesser Polish-speaking regions of Podhale and Sądecczyzna (n=140)16 and for a number of modern Polish populations,16, 17, 18 including Kaszuby (n=142) and Podhale and Sądecczyzna (n=226). Multidimensional scaling (MDS) based on linearised distances19 was carried out with the use of STATISTICA 9.1 software (StatSoft, Tulsa, OK, USA). Network 4.6 software (Fluxus Technology, Clare, UK) was applied to build a median-joining network20 of Y-STR haplotypes with a maximum parsimony option.21 Mean pairwise differences (MPDs) within populations based on the 17-STR haplotypes and the weighted mean intralineage MPDs (WIMPs) were calculated as previously described.22 STR variation within chosen haplogroups was assessed by genetic variance (VP)23 and by average squared difference in the number of repeats between all chromosomes and a median haplotype, averaged over microsatellite loci (ASD0).24

The pre-WWII Polish samples were additionally divided into three subgroups, depending on surnames of the tested individuals. The first group comprised individuals carrying surnames with roots revealing Slavic/eastern European etymology or origin. Accordingly, males with surname roots indicating German/western European etymology or origin were included in the second group. The third group contained surnames with unclear or hybrid etymology. For each surname, the assignment was based on linguistic analysis provided in etymological dictionaries.25, 26, 27

BATWING28 was used to assess time of demographic expansion and split of the populations of Kaszuby and Lusatia. Time of start of demographic expansion, growth rate and time of population split were estimated using a model of exponential growth from a constant-size ancestral population. Observed mutation rates for each marker were used in the analysis.29 Y-STR mutation data published in the Y Chromosome Haplotype Reference Database30 and in the literature29, 31 were used to set mutation rate priors as provided in Supplementary Table S1. An initial effective population size and growth rate were given priors of gamma(1.1,0.0001) and gamma(1.01,1), respectively, in order to cover very wide ranges of possible values.32 Maximally uninformative uniform priors were set for dates of the expansion start and population split. SNP information was integrated for the phylogenetic reconstruction, but it was not considered for posterior estimates. A total of 10 million Markov chain Monte Carlo (MCMC) samples were collected: the first 5 million were rejected as burn-in and the remaining 5 million were used for inference. BATWING convergence was assessed from two independent runs with different seeds with the use of Gelman and Rubin’s convergence diagnostic available in the CODA package for R.33, 34 In order to put the BATWING results in a historical time scale, a male generation interval of 31 years35 was used.

Populations speaking Sorbian and Kashubian, linguistically the most closely related to extinct Slavic dialects spoken in the past in present-day eastern Germany, were used to assess Slavic ancestry in the eastern German Y-chromosomal pool. In addition, German admixture was assessed in genetic outliers detected in the MDS analysis, that is, the Sorbs and Kashubes, with the Greater Polish-speaking population of Kociewie as the parental population (the Greater Polish dialects directly neighbour the Kaszuby region and share linguistic similarities with the Lusatian dialects9). For haplogroup data, genetic admixture estimators based on allele frequencies were assessed. An mR estimator comparing directly haplogroup frequencies was computed with the use of Admix 2.0.36 A maximum likelihood approach-based mW estimator considering an effect of genetic drift in admixed and parental populations was obtained with the aid of Leadmix software.37 As the overwhelming majority of Y-STR haplotypes were singletons specific to only one population, in case of STR data, an mY estimator taking into account molecular distances between haplotypes rather than haplotype frequencies was computed with the use of Admix 2.0. In order to eliminate likely haplotype homoplasy, SNP phylogeny was integrated into STR information, weighting biallelic mutations 1000-fold higher than STR mutations.38 The molecular relationship between haplotypes was defined as the sum of squared differences in allele sizes.38

Results

A total of 39 different haplogroups have been detected in the studied sample set (Figure 1), including an insertion polymorphism at M91 (M91insT with a stretch of 10 thymidines) previously observed in two individuals from a large worldwide sample set.39 No derived alleles at R-M153 (a subclade of R-P312) and R-M222 (a subclade of R-L21) have been detected. Genotyping results for all 1156 individuals are provided in Supplementary Table S2.

AMOVA in the studied populations revealed statistically significant support for two linguistically defined groups of populations in both haplogroup and haplotype distributions (Table 2). It also detected statistically significant genetic differentiation for both haplogroups and haplotypes in three Polish pre-WWII regional populations (Table 2). The AMOVA revealed small but statistically significant genetic differentiation between the Polish pre-war and modern populations (Table 2). When both groups of populations were tested for genetic structure separately, only the modern Polish regional samples showed genetic homogeneity (Table 2). Regional differentiation of 10-STR haplotypes in the pre-WWII populations was retained even if the most linguistically distinct Kashubian speakers were excluded from the analysis (RST=0.00899, P=0.01505; data not shown). Comparison of Y chromosomes associated with etymologically Slavic and German surnames (with frequencies provided in Table 1) did not reveal genetic differentiation within any of the three Polish regional populations for all three (FST, ΦST and RST) genetic distances. Moreover, the German surname-related Y chromosomes were comparably distant from Bavaria and Mecklenburg as the ones associated with the Slavic surnames (Supplementary Figure S2). MDS of pairwise genetic distances showed a clear-cut differentiation between German and Slavic samples (Figure 2). In addition, the MDS analysis revealed the pre-WWII populations from northern, central and southern Poland to be moderately scattered in the plot, on the contrary to modern Polish regional samples, which formed a very tight, homogeneous cluster (Figure 3).

Table 2 AMOVA results for the studied populations (Hg=39 Y-SNP subclades; Ht17=17 Y-STRs) and for previously published data for Polish pre-war and modern populations (Ht7=7 Y-STRs) (Roewer et al;17 Woźniak et al16, 18)
Figure 2
figure 2

MDS analysis of (a) FST values for Y-chromosomal haplogroups and (b) ΦST values for 17-locus Y-STR haplotypes observed in the studied populations.

Figure 3
figure 3

MDS analysis based on ΦST distances for 7-locus Y-STR haplotypes observed in the studied populations compared with data published for 12 Slavic and Germanic populations.16, 17 Filled circles indicate modern populations from northern (Gda Gdansk), central (War Warsaw) and southern Poland (Cra Cracow). Empty circles indicate pre-WWII populations from northern (KaN, KaC, KaS northern, central, southern Kaszuby; Ko Kociewie), central (Ku Kurpie) and southern Poland (PoS). Other Slavic populations: Lu Lusatia; Sl western Slovakia. German populations: Me Mecklenburg; Ba western Bavaria; Gre Greifswald; Ber Berlin; Lei Leipzig; Mai Mainz; Mün Münster. Other Germanic populations: Den Denmark; Got Gotland (Sweden); Ble Blekinge (Sweden).

The MPD and WIMP values did not reveal significant reduction in Y-chromosomal diversity in populations with differential degree of cultural and/or geographic isolation, that is, Kaszuby, Lusatia and Kurpie (Table 1). In order to check for the effect of sampling pre-WWII populations on STR variation, genetic variance (VP) and average squared difference (ASD0) were assessed within the most common haplogroups found in the studied Slavic populations: R-M17*(xM458) and R-M458. Both parameters reached lower values in the native pre-WWII populations of the Vistula and Oder basins in comparison with the modern Polish population studied by Underhill et al.40 A value comparable to the modern Poles was obtained only in the case of ASD0 in the R-M17*(xM458) chromosomes from Kaszuby (Table 3). A median-joining network of our R-M17*(xM458) 17-STR haplotypes revealed a clearly separated cluster of Y chromosomes, involving as many as 22 individuals from Kaszuby, as well as several individuals from other Slavic populations (Supplementary Figure S3). The observed cluster is likely to represent an unknown R-M17 subclade and explains the high ASD0 value in haplogroup R-M17*(xM458) among the Kashubes.

Table 3 VP and ASD0 for 17 Y-STRs in haplogroups R-M17*(xM458) and R-M458 in native pre-war regional populations of the Vistula and Oder basins (this study) and in the modern Polish population, studied by Underhill et al40

BATWING of the Slavic populations of Kaszuby and Lusatia provided convergent MCMC chains with unimodal distribution and revealed that their divergence took place 1.7 kya (95% confidence intervals: 1.4–2.1 kya) and was preceded by 0.6 ky of demographic expansion with a 4.2% growth rate (Table 4).

Table 4 Times of demographic expansion and split for Y chromosomes from the populations of Kaszuby and Lusatia

As both the Sorbs and Kashubes are historically the most closely related to the extinct Slavic tribes of eastern Germany and none directly contributed to the modern German population of Mecklenburg, it was assumed that the population of Mecklenburg resulted from admixture of western German (Bavarian as a proxy), Sorbian and Kashubian populations. All the ancestry estimates were the highest for the western German population (Supplementary Table S3). On the other hand, admixture analysis failed to detect considerable German ancestry in paternal lineages of genetic outliers detected in the MDS analysis, that is, the Sorbs and Kashubes (Supplementary Table S4). After inclusion of data from German regional populations studied by Kayser et al,3 the Slavic (Sorbian or Kashubian) ancestry estimates mR, mW and mY for the pooled eastern German populations (n=678) in comparison with the pooled western German populations (n=886) ranged from 0.182 to 0.261.

Discussion

Most molecular anthropological studies concerning early human history in Central Europe29, 40, 41 exploit previously observed geographic homogeneity of Polish paternal lineages.2 Although it was suggested that the homogeneous Polish Y-chromosomal gene pool was formed very recently after the massive human resettlements linked to the WWII,2 a previous study on a southern Polish population failed to detect genetic differences between pre-WWII and post-WWII Y chromosomes in the region.16 However, it should be noted that the studied region did not experience massive population exchange and its post-WWII settlers originated mainly in the neighbouring areas.16 The same authors studied a modern population of Kaszuby, the most linguistically distinct ethnic group among modern Poles, and no genetic differentiation within the Polish population was found.18 Our results are based on pre-WWII regional populations from four out of five main Polish linguistic/dialectal groups (Kashubian, Masovian, Greater Polish and Lesser Polish), and demonstrate for the first time that the Polish paternal lineages were unevenly distributed within the country before the forced resettlements of millions of people during and shortly after the WWII. Small but statistically significant differentiation between the pre-WWII and modern populations is particularly remarkable taking into account the fact that modern Polish regional samples comprise varying ratios of pre-WWII inhabitants and post-WWII settlers. The observed heterogeneity suggests that precautions should be taken in order to collect representative population samples from Poland for evolutionary studies, as well as for forensic purposes in case of statistical evaluation of genetic evidence concerning regions densely populated by native pre-WWII inhabitants.

Alternatively, the observed substructure could result from the fact that our pre-WWII samples originated in rural areas that were less likely to be influenced by migrations than large cities,32 whereas Ploski et al2 revealed geographic homogeneity of Y-chromosomal lineages in general populations of several Polish regions. However, it should be noted that WWII-mediated resettlements involved both urban and rural populations. The study by Woźniak et al18 on the modern population of Kaszuby from villages and small towns did not detect its distinctiveness from other modern Polish regional samples, which may be owing to the fact that in 1950, the post-WWII settlers constituted as many as 36.7% of inhabitants of an area roughly corresponding to the regions of Kaszuby and Kociewie42 (in case of populations studied by Ploski et al,2 in 1950, the share of post-WWII settlers ranged from 6.8% in the Cracow region up to 93.8% in the Wroclaw region42) and discards rural origin of our pre-WWII Polish regional populations as the main reason for the detected substructure.

Parameters measuring STR variation within Y-chromosomal haplogroups are commonly used for dating of SNP mutations in order to draw conclusions about origins and history of human populations.23, 24 Underhill et al40 observed the highest genetic diversity in Europe for R-M17*(xM458) and R-M458 subclades in the Vistula and Oder basins, which correspond roughly to the present-day territory of Poland. We examined Y-STR variation within the two subclades in pre-WWII Polish regional populations of the Vistula basin (Kurpie, Kociewie and Kaszuby) and in a native population of the Oder–Elbe basin borderland (Lusatia), and revealed a similarly high ASD0 value as in the modern Polish population only for R-M17*(xM458) in Kaszuby, which we explained by the presence of an unknown subclade detected in the median-joining network. Apart from R-M17*(xM458) in Kaszuby, genetic diversity for both R-M17 subclades was lower (in several cases much lower) in the native pre-WWII populations than in the modern one. This may be owing to the extensive mixing of the Polish population after the post-WWII massive resettlements, with millions of modern Poles tracing their pre-WWII origin to the Dniester, Dnieper and Neman basins in present-day Ukraine, Belarus and Lithuania.

Kayser et al3 revealed significant genetic differentiation between paternal lineages of neighbouring Poland and Germany, which follows a present-day political border and was attributed to massive population movements during and shortly after the WWII. Although the very recent origin of the geographic course of the detected genetic boundary is undoubted, it remained unknown whether Y-chromosomal diversity in ethnically/linguistically defined Slavic and German populations, which used to be exposed to intensive interethnic contacts and cohabit ethnically mixed territories, was clinal or discontinuous already before the war. In contrast to the regions of Kaszuby and Kociewie, which were politically subordinated to German states for more than three centuries and before the massive human resettlements in the mid-20th century occupied a narrow strip of land between German-speaking territories, the Kurpie region practically never experienced longer periods of German political influence and direct neighbourhood with the German populations. Lusatia was conquered by Germans in the 10th century and since then was a part of German states for most of its history; the modern Lusatians (Sorbs) inhabit a Slavic-speaking island in southeastern Germany. In spite of the fact that these four regions differed significantly in exposure to gene flow with the German population, our results revealed their similar genetic differentiation from Bavaria and Mecklenburg. Moreover, admixture estimates showed hardly detectable German paternal ancestry in Slavs neighbouring German populations for centuries, that is, the Sorbs and Kashubes. However, it should be noted that our regional population samples comprised only individuals of Polish and Sorbian ethnicity and did not involve a pre-WWII German minority of Kaszuby and Kociewie, which owing to forced resettlements in the mid-20th century ceased to exist, and also did not involve Germans constituting since the 19th century a majority ethnic group of Lusatia. Thus, our results concern ethnically/linguistically rather than geographically defined populations and clearly contrast the broad-scale pattern of Y-chromosomal diversity in Europe, which was shown to be strongly driven by geographic proximity rather than by language.1 They are also consistent with a previous study on autosomal markers, which provided evidence for clear genetic departure of the Sorbs from the neighbouring Germans and their genetic similarity to the Slavic-speaking Poles and Czechs.43 Although data for German-speaking populations that used to live in the neighbourhood of the Slavs of Kaszuby and Kociewie are not available, data from the Sorbs and neighbouring Germans could be used as a proxy, and our AMOVA results and ancestry estimates suggest that a genetic barrier between Slavic and German speakers similar to the one detected by Kayser et al3 between modern Poland and Germany might have existed already before the war.

Immel et al44 revealed German and Slavic surname-associated strata in the Halle region in southeastern Germany, which was explained by the 19th century migration from the Polish-speaking territories. As German surnames are frequently encountered among the modern Poles, we have searched for such differentiation within the Polish pre-WWII regional populations. Both Slavic and German surname carriers revealed regional Y chromosome homogeneity and comparable genetic distances from the German populations, which suggests that etymologically German surnames in the studied populations may result, at least partially, from foreign administration and linguistic adaptation (eg, translation, common until the end of the 19th century and attested also in the 20th century), well documented in historical sources,26, 27 rather than owing to genetic admixture.

Two main factors are believed to be responsible for the Slavic language extinction in vast territories to the east of the Elbe and Saale rivers: colonisation of the region by the German-speaking settlers, known in historical sources as Ostsiedlung, and assimilation of the local Slavic populations, but contribution of both factors to the formation of a modern eastern German population used to remain highly speculative.8 Previous studies on Y-chromosomal diversity in Germany by Roewer et al17 and Kayser et al3 revealed east–west regional differentiation within the country with eastern German populations clustering between western German and Slavic populations but clearly separated from the latter, which suggested only minor Slavic paternal contribution to the modern eastern Germans. Our ancestry estimates for the Mecklenburg region (Supplementary Table S3) and for the pooled eastern German populations, assessed as being well below 50%, definitely confirm the German colonisation with replacement of autochthonous populations as the main reason for extinction of local Slavic vernaculars. The presented results suggest that early medieval Slavic westward migrations and late medieval and subsequent German eastward migrations, which outnumbered and largely replaced previous populations, as well as very limited male genetic admixture to the neighbouring Slavs (Supplementary Table S4), were likely responsible for the pre-WWII genetic differentiation between Slavic- and German-speaking populations. Woźniak et al18 compared several Slavic populations and did not detect such a sharp genetic boundary in case of Czech and Slovak males with genetically intermediate position between other Slavic and German populations, which was explained by early medieval interactions between Slavic and Germanic tribes on the southern side of the Carpathians. Anyway, paternal lineages from our Slovak population sample were genetically much closer to their Slavic than German counterparts.

Coalescence-based analysis of populations sharing common ancestry, which experienced subsequent cross-migration, leads to underestimation of their divergence time. On the other hand, coalescence-based analysis of populations sharing common ancestry, which experienced subsequent gene flow with unrelated populations, is likely to overestimate their divergence time and affect other demographic parameters. As the model implemented in BATWING does not assume migration between diverged populations, our analysis was performed on populations of Kaszuby and Lusatia, which owing to geographic remoteness and a linguistic barrier remained isolated from each other and from their German-speaking neighbours. Our coalescence-based divergence time estimates for the two isolated western Slavic populations almost perfectly match historical and archaeological data on the Slavs’ expansion in Europe in the 5th–6th centuries.4 Several hundred years of demographic expansion before the divergence, as detected by the BATWING, support hypothesis that the early medieval Slavic expansion in Europe was a demographic event rather than solely a linguistic spread of the Slavic language.