Mitochondrial DNA variability of the Polish population

The aim of the present study was to define the mtDNA variability of Polish population and to visualize the genetic relations between Poles. For the first time, the study of Polish population was conducted on such a large number of individuals (5852) representing administrative units of both levels of local administration in Poland (voivodeships and counties). Additionally, clustering was used as a method of population subdivision. Performed genetic analysis, included FST, MDS plot, AMOVA and SAMOVA. Haplogroups were classified and their geographical distribution was visualized using surface interpolation maps. Results of the present study showed that Poles are characterized by the main West Eurasian mtDNA haplogroups. Furthermore, the level of differentiation within the Polish population was quite low but the existing genetic differences could be explained well with geographic distances. This may lead to a conclusion that Poles can be considered as genetically homogenous but with slight differences, highlighted at the regional level. Some patterns of variability were observed and could be explained by the history of demographic processes in Poland such as resettlements and migrations of women or relatively weaker urbanisation and higher rural population retention of some regions.

Introduction mtDNA analysis has a very important role in the identification of the origin of individuals in a population. It is used especially in population genetics and molecular evolution studies and allows to understand the question of human migration and settlement from different regions of a country or the whole world [1,2]. Maternally inherited mitochondrial DNA haplogroups indicate the mother line ancestry and have been identified in geographically isolated populations throughout the globe [3]; indicating the human migration and ancestry [4]. Haplogroups from Africa (L0, L1, L2, L3) are found to be the oldest and those which have evolved to European, Asian and Native American ones with geographic migrations and climate adaptations [5]. Nine haplogroups are found to be major in the European population and are as following: H, U, J, T, K, W, I, V and X [6,7]. The most frequent European haplotypes were classified into HV, U and JT macro-haplogroups forming 90% of population [3]. Several sources indicate the haplogroup H as the most frequent in Europe [8].
mtDNA variability in Polish population was studied in comparison to Russians [1,2] or as an element of broader group of Slavs [9][10][11]. Studies with specific attention to administrative division and/or geographic context are still limited. There is no detailed information about mtDNA haplotypes which are characteristic for the representatives of particular voivodeship (województwo) or county (powiat) in Poland. Clustering, as an additional method of grouping of individuals, has also never been used in relation to the Polish population.
National population biobanks and sample repositories store human biological material for the use mostly in genetic research to connect the lifestyle and medical history with genetic traits. Genetic and molecular information associated with the data about the sample donor can also be used in population studies [12]. Furthermore, high-density SNP microarrays, a successful tool to analyse large amounts of genetic data, were used in many population studies to analyse the structure and ancestry of global [13], European [14][15][16] and individual country populations [17,18].
The aim of the present study was to determine mtDNA variability of the Polish population, including geographical and historical context. For this purpose, obtained haplotypes of 5852 individuals were classified into major haplogroups and subhaplogroups, and their distribution for units of the first (voivodeship) and second (county) level of local government and administration in Poland was analysed. For the first time, the study of the Polish population was conducted on such a large number of individuals. The gathered data set was then clustered on the basis of genetic information as well as the information about the place of origin, letting us to compare the quite artificial division into voivodeships (n = 16) and counties (n = 349) with the more natural division into clusters (n = 80) which may largely correspond to geographic regions.

Population
The studied population consisted of individuals recruited between 2010 and 2012 within the TESTOPLEK research project. All samples belonged to the POPULOUS collection which is registered since 2013 in the BBMRI catalog [19,20]. The experimental group included samples taken from 5852 individuals representing administrative units of both levels of local administration in Poland: all 16 voivodeships ( Fig. 1 and Fig. S1) and the majority of counties (349 out of total 380-this number includes counties and city counties). Written information about the place of birth and current residence was obtained from each subject. Approval for this study was obtained from the University of Łódź Ethics Review Board. All procedures were performed in accordance with the Declaration of Helsinki (ethical principles for medical research involving human subjects). The full set of results can be obtained at the European Genotype Archive [21] (www.ebi.ac.uk/ega; study accession number, EGAS00001003309).

Clustering
K-means clustering method applied to spatial coordinates was used to merge individual counties in larger geographic groups (clusters) on the basis of the nearest mean. Each cluster ( Fig. 1) is represented by the geographic centre of the cluster and the algorithm converges to stable centroids of clusters [22]. Clustering was performed with Scikit-learn package [23] in Python ver. 3.6.3 [24].
The list of clusters containing the information about the cluster number to which each county was assigned as well as the name of the corresponding geographic region, is gathered in Supplementary Table S1.

Microarray analysis
Infinium HTS Human Core Exome PLUS microarrays were used to genotype DNA samples for 551,945 SNPs according to the manufacturer's protocol (Illumina Inc., San Diego, CA, USA). Qualitative analysis was performed to identify outliers and artefacts on the microarray. Samples were excluded if call rate was below 0.94 and if the 10% GenCall parameter was below 0.4. Visual inspection was conducted to investigate the heteroplasmy, which was detected only in a few cases.

mtDNA typing
Applied microarrays allowed the identification of 323 SNPs (single nucleotide polymorphisms) in mtDNA (Tab. S2) according to recommendations for the description of sequence variants [25]. Quality control procedures were conducted using PLINK software [26]. The homemade script was used to convert raw data obtained in PLINK format for use by Haplogrep software.

Statistical analysis
Haplogrep software was then used to classify haplotypes into haplogroups and subhaplogroups (Phylotree build 17, http://phylotree.org/tree/index.htm) [27]. Haplogroup frequencies were calculated for every voivodeship and county by counting. The analysis of molecular variance (AMOVA) together with F ST values [28], both for voivodeships and clusters was determined using Arlequin v3.5 software [29]. To visualize the relationships between every voivodeship and every cluster, multidimensional scaling (MDS) analysis was constructed to plot the pairwise genetic distances F ST with cmd scale function in R ver. 3.4.2 [30]. Furthermore, to determine the spatial pattern of genetic divergences (to type the most probable, geographic model of population grouping), SAMOVA (spatial analysis of molecular variance) [31] was done in SAMOVA ver.

Voivodeship comparison
Four voivodeships, the Greater Poland, Silesian, Łódź, and Lower Silesian ones, reveal similar structure to Poland's average in terms of relative frequency of the six major haplogroups (H, U. J, T, HV, K). The highest number of haplogroups (n = 19) was observed in Silesian voivodeship while the lowest (n = 10) was observed in Holy Cross voivodeship. However, they were represented by as the largest and the smallest sample number; 963 and 72, respectively. Among the voivodeships with the sample number between 200 and 500, 17 haplogroups were observed in Lesser Poland while only 12 in Łódź voivodeship. Analysis based on Pearson's Chi-square test was performed to assess the differences between voivodeships infrequencies of 10 main haplogroups (Table S4-S13). The obtained results of Pearson's Chi-square test pushed us to look more closely at the differences between regions. Therefore, interpolation analysis was performed for the frequencies of eight main haplogroups to show their distribution across Poland. Illustration of the frequencies of haplogroups on the map of Poland using interpolation method allowed us to underline the differences between regions. Different pattern of distribution of eight main haplogroups was observed for every voivodeship. However, observed differences were on a relatively low level (Figs 3 and 4).

Genetic variability
To define differentiation among Polish population in terms of the similarities and differences between voivodeships, paired F ST analysis were performed. All F ST estimates were positive but low and ranged from 0.00011 to 0.02045 (Table S14). The highest, statistically significant differences were observed between F ST values calculated for Holy  Table S14).
Detailed information about F ST values calculated for all voivodeships and clusters are gathered in Tables S14 and S15 (in Supplementary materials) and are presented on Figure S2 and S3, respectively.

AMOVA
Analysis of molecular variance based on the mtDNA sequences reveals that most of the variation occurs within populations when voivodeships were taken into account (99.78%; p = 0.01075). Only a small proportion of total variance was attributed to variation among groups also in the case of voivodeships (0.21%; p = 0.01075) ( Table 2). Analysis of molecular variance computed for cluster populations also reveals that most of the variation occurs within populations (99.09%; p = 0.00109). Only a small proportion of total variance was attributed to variation among clusters (0.91%; p = 0.00109) ( Table 2).

SAMOVA
Analysis of the molecular variance conducted in SAMOVA ver. 2.0 software, based on the mtDNA SNPs and aimed at determining the most probable number of genetically different population groups, showed that the maximal number of significantly divergent groups is 33. The highest variance among groups was observed when the population was divided into 2 groups (cluster no. 64 separated from the rest of clusters, variance among groups = 2.41%; p < 0.00001). With the ncreasing number of groups we could observe downward sloping trend with fluctuations, so we could identify local maxima (2, 4, 7, 11, 15, 18, 21, 23, 25, 29 and 31 groups). When dividing into groups corresponding to local maxima, the following clusters were separate: no. 64 (7 times), no. 34 (6 times) and no. 71 (5 times) while cluster 37 was grouped with 59 (5 times).When dividing into the maximum number of 33 groups, the variance among groups was equal to 0.32% (p < 0.00001). Discussion mtDNA variability in Polish population was previously studied in comparison to Russians [1,2] or as the element of broader group of Slavs [9][10][11]. In the current study, an attempt to completely describe mtDNA variability and genetic connections for Polish population was made, based on a large group of individuals (5852) and including administrative unit clustering as an additional method of population dividing for increased geographical relevance. Analysing the frequencies of haplogroups, H was found to be the one most often occurring in the Polish population. It is consistent with the findings of Grzybowski et al. [9] and Mielnik-Sikorska et al. [11]. An interesting analysis of haplogroup and subhaplogroup distribution was done by Malyarchuk et al. [1] but we can compare our results only to the main findings for the entire Polish population without the division into regions. The cited study of Polish population showed 45.2 % frequency [1], which is almost identical to our findings. Similarly, in the case of U, J, T, K and W haplogroups, frequencies obtained in the current study were practically the same compared to Malyarchuk et al. [1], whose study was based on the analysis of 436 individuals from Kuyavian-Pomeranian region. The only difference was observed in the case of HV haplogroup. Malyarchuk et al. [1] identified 1% frequency of this haplogroup while in our study it was 4.46 %. The number of individuals can be an explanation, as in the case of rare haplogroups, the size of studied samples has a great importance. Our findings are also consistent with other studies of European population [8,33] as well as individual countries such as: Spain [34,35], Portugal [36] with Azores [37], islands of North Atlantic [38], Sardinia [39] and Russia [1,2], where haplogroup H was also indicated as the most frequent.
Most of the voivodeships in Poland reveal divergent patterns of major haplogroup frequencies, which differ from the values for Poland in general. In literature data, description of regional populations of Poland basing on the mtDNA haplogroup distribution can be found only for selected regions, such as: Gdańsk, Kashubia, Suwałki, Upper Silesia [9] and Podhale [11]. In the case of haplogroup H, our results (compared at the level of appropriate administrative units, i.e., voivodeships or counties) were consistent with literature for all studied regions except Podhale where the frequency was around 30% [11] while in our study (Tatra county) it was 19,5%. In Gdańsk region, frequencies of the 6 most common haplogroups obtained in the studies of Grzybowski et al. [9] were almost the same as in the current study: Relating to the studies about Ashkenazi maternal lineages [40] and mitochondrial markers of Jewish ancestry [41] and analysing proposed motifs to define four major Ashkenazi founder clusters (K1a1b1a, K1a9, K2a2a and NH1b1), we could not present their occurrence within the Polish population because of the lack of polymorphic sites (16093-16176-16223-16224-16234-16311-16519) on the microarray used. Only one site from the proposed motif was present (16145). Grzybowski et al. [9] found K1a1b1a lineage in individuals from Gdańsk region and Upper Silesia, based on the specific mtDNA motif.
Interestingly, the frequency of L haplogroup, one of the rarest in Europe, observed in the current study and the study of Mielnik-Sikorska et al. [11] was similar for Podhale region (1% vs. 3%). L1b is the most common African clade in Europe; [42] in the studies of Mielnik-Sikorska et al. [11], L1b1a8a and L2a subclusters were identified among Polish individuals, with the presence of L2a1 haplotype ascribed to Ashkenazi Jewish influences. In this study, both haplotypes (L1b1 and L2a1) were found in individuals from different regions of Poland: 4 individuals with L1b1 from Upper Silesia and 3 individuals with L2a1 haplotype from Gorlice and Częstochowa counties. We additionally identified L2e and L3e subclades: 1 individual with L2e from Nowy Tomyśl county and 2 individuals with L3e from Poznań. Interestingly, L0 is the most common haplotype in East Africa, the Near East and Arabian Peninsula [43]. In the current study, L0a1a was found in 2 individuals from Tatra county. In the current study, we focused on the genetic relationships and regional connections, omitting a detailed subhaplogroup analysis. However, the frequencies were calculated and H1 (15.42%), U5 (12.35%) and J1 (8.34%) were observed as the most frequent subhaplogroups in Polish population; this is also in agreement with the studies of Grzybowski et al. [9] (Table S2).
As mentioned above, the Polish population was the subject of the genetic research, but only in comparison to broader groups of Slavs or Europeans. Grzybowski et al. [9] made a genetic analysis, based on haplogroup frequencies, of four populations from Poland (Suwałki, Gdańsk regions, Kashubia and Upper Silesia) in comparison to selected populations of Russia. Suwałki was indicated as the most divergent region, separated from remaining Polish populations and grouped together with northwestern Russians. In  [44]. All of these have caused the destruction of social relations, but on the other hand, allowed to form a well-mixed and homogeneous population. The homogeneity of the Polish population was mentioned before in the studies of Płoski et al. [45], Kayser et al. [46], Woźniak et al. [47] and Rębała et al. [48], however, results were based only on the analysis of Y chromosome. The results of this study complement the description of the Polish population and confirm that our population is homogenous as far as mtDNA variability is concerned. Our study showed that most of the molecular variation based on the mtDNA sequences occurs within the population at large and a very low variation was detected among subpopulations, both when voivodeships and clusters were taken for analysis. Despite this homogeneity, some patterns of variability (separate voivodeships and clusters) are observed and can be explained by the history of demographic processes in Poland. West Pomeranian and Warmian-Mazurian voivodeships were observed as outliers in the Polish population, which could confirm their genetic separateness caused by resettlements and migrations of women. These voivodeships were settled after the Second World War by people inhabiting Kresy (Eastern Borderlands of Poland). West Pomeranian was settled mostly by people from Baranowicze, Pińsk and Kowel regions (now Belarus and Ukraine) while Warmian-Mazurian was settled by Poles living in Vilnius region (now Lithuania) [44]. However, in the current study, Holy Cross and Łódź voivodeships were found to be the most separate, which is not reflected in the history of the migration of Poles. These voivodeships are quite native in their population composition and have not been the areas of massive migrations. In this case, the reason for separation must be different, but is probably also connected with demographic processes occurring in this part of Poland, such as relatively weaker urbanisation and higher rural population retention. Furthermore, detailed analysis of clusters showed that only a few of them located within those voivodeships were statistically separate. Thus, it cannot be proven that migration was the reason for genetic separation.
For the first time, clustering, a method of population subdivision, was used to define the genetic relationships within the Polish population. Additionally, the administrative division of Poland was overlaid on genetic separation in order to present the most complete view of Polish society. Furthermore, for the first time, the study of the Polish population was conducted on such a large number of individuals (5852). All of this makes it difficult to find similar studies to directly relate our findings to results of others. There were important differences in analysis based purely on administrative division and on geographical clustering, which is expected, showing that a large dataset makes it possible to perform a deeper and more relevant analysis.
Our comprehensive analysis of mtDNA variability, based on the data from 5852 individuals, allowed us to describe the mtDNA variability of Polish population and genetic relations between Poles. It gives a better insight into mtDNA variability in Poland, with detailed administrative divisions and geographical regionalization. A complete genetic analysis including all voivodeships and most counties of Poland has been performed for the first time. Poles are characterized by the main West Eurasian mtDNA haplogroups, but relatively minor genetic differences observed on the level of voivodeships and clusters may indicate historical and cultural influences. Although the level of differentiation within the Polish population was found to be low, the existing genetic differences can be explained well with geographic distances. Using a large set of data, it was shown that Poles can be considered as genetically homogenous but with slight differences, highlighted at the regional level. The structure of our study allowed us to confirm that intrastate administrative divisions are artificial formations and do not reflect the genetic diversity of specific populations. Spatial information-based clusters are more adequate and in similar studies, researchers should consider grouping available samples based on geographic location, enhancing the quality of analysis in comparison to division into voivodeships and counties. The following study was based only on mitochondrial markers, which can illustrate gene flow in the maternal line. Therefore, conclusions can be drawn exclusively about migrations and settlements of women. Certainly, the present survey could be the basis for further research relating to the historical context of human migration or resettlements, when expanded with an analysis of chromosome Y.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/.