Introduction

The Italian population has a greater degree of internal genomic variability compared with other European countries.1 This reflects geographic isolation within Italy, because of its mountainous topography, and to historical events that triggered demographic changes. An example of the latter is the long history of the Roman Empire (27 BCE–476 CE), whose decline was accompanied by a considerable reduction in population size and a series of subsequent migratory waves across Italy. Due to its crucial position at the centre of the Mediterranean basin, the Italian peninsula has experienced a complex history of colonization and migration and the genetic signatures of these human origins are still present in contemporary Italians.2, 3, 4 Deeper insight into the history of the Italian population is also critical for understanding the peopling of Europe.

The genetic structure of Italy, whose unity of people and culture is quite recent, was initially analysed using classical genetic markers by Piazza et al.5 and Cavalli-Sforza et al.6 Recently, using genome-wide data, O’Dushlaine et al.7 defined fine-scale genetic differences among people from different rural villages in Northern and Central Italy. Geographical patterns of Y chromosomal and mtDNA diversity in Italy, mainly determined by the combined action of drift and founder effects, have been described.8, 9 A correlation between genetic and geographic structure in Europe has also been found,10, 11 with a detectable distinction between Southern Italians and other Europeans.

In addition to evolutionary history, the identification of genetic substructure in apparently homogeneous populations can improve association mapping in admixed populations.12 Moreover, it has recently been shown that variation in susceptibility to certain diseases, and response to drugs or therapies, is related to different proportions of non-European ancestry in admixed populations.13

The aim of this study was to investigate fine-scale Italian population genetic substructure. We used both the large single-nucleotide polymorphism (SNP) data set, which we collected from a well-characterized Italian sample, and the most recent haplotype-based population genetic algorithms, such as fineSTRUCTURE,14 which are able to provide finer resolution of the genetic structure of populations, as was shown for the UK population by Leslie et al.15 Specifically, our aims were to test the feasibility of identifying differences at the microregional level within Italy, to compare and quantify the contributions of populations from Europe and the Mediterranean basin to the genetic composition of the Italians and to explore the historical events that led to the observed high genomic variability within Italy. Several analytical approaches were combined to obtain a complete portrait of ‘the Italian genome’, and to test the robustness of our results across different methodologies. We also traced the major historical events that led to the complex genomic mosaic observed within Italy, including demographic changes and waves of migration across Europe and the Mediterranean basin.

Materials and methods

Subjects

Italian subjects were selected from 11 out of the 20 Italian administrative regions according to a list of specific region/geographical area/province typical surnames on the basis of previous works by Zei et al.16, 17 Each subject included in the study had a well-defined geographical origin: four grandparents (and parents) born in the same administrative region, assessed through interviews at the time of blood collection. The distribution of the study samples across Italy and the sampling provinces is shown in Figure 1 and Supplementary Table S1 with sample sizes. Informed consent was provided by all the participants at the time of enrolment. An internal ethical review board at the Human Genetics Foundation (HuGeF, Comitato Etico HUGEF/15-12-2011) approved this study.

Figure 1
figure 1

Sampling location barycentre (average sampling place weighted by the number of individuals from each sampling point) and sample size for each of the 11 Italian regions analysed in this study (Latitude and Longitude position within parentheses): Aosta Valley (45°70′N–7°30′E); Basilicata (40°69′N–16°57′E); Calabria (38°10′N–15°70′E); Emilia Romagna (44°80′N–11°60′E); Latium (42°40′N—12°10′E); Liguria (44°30′N–8°50′E; Lombardy (45°39′N–9°85′E); Piedmont (45°19′N–7°90′E); Sardinia (40°22′N– 9°30′E); Sicily (37°41′N–13°72′E); Tuscany (43°31′N–11°35′E).

Data sets

Two data sets were used to address the different topics of this study.

The first comprised 1 698 926 SNPs investigated in 300 Italian samples, and was used to compare genetic profiles among subjects sampled from the different Italian regions and macroareas. Genome-wide genotype data and the macroarea of origin for each sample will be made available via the European Genome-phenome Archive (EGA) repository (https://www.ebi.ac.uk/ega) under the accession number EGAS00001001458, and they are already available upon request from the HuGeF repository (see Additional information section for the reference web link). See Supplementary Methods for details about data collection, DNA extraction, genotyping, quality controls and SNP imputation procedures. The second data set comprised 347 131 SNPs assayed on 1272 samples (Table 1), and was used for the comparison between the Italian and non-Italian populations.

Table 1 List of samples included in the study

Italian population substructure

The Italian population substructure was investigated with the model-based Bayesian cluster algorithm implemented in the combined software ChromoPainter-fineSTRUCTURE,14 using a model that takes into account linkage disequilibrium (LD) between SNPs (‘linked’ model) with the default parameters. The results were reported as a heatmap (coancestry matrix) of pairwise similarity between subjects, expressed as the number of genomic segments inherited by the same source population. Principal component analysis (PCA) was carried out using the same coancestry matrix, and a dendrogram based on hierarchical clustering was constructed. The clustering algorithm uses a Bayesian approach, in which the number of donor populations K and the expected proportion of chunks from each donor population to each individual are inferred by maximum likelihood. Assignment to each cluster is then performed using a Markov Chain Monte Carlo method. See Lawson et al.14 for details.

Identical-by-descent calling

Pairwise shared identical-by-descent (IBD) segments were identified by the fastIBD method implemented in BEAGLE v.3.218 using default parameters. The postprocessing procedure suggested by Ralph and Coop19 was used to minimize the number of false-positive calls. Briefly, the IBD call algorithm was run 10 times with 10 different random seeds; any segment not overlapping a segment seen in at least one other run was removed; any two segments separated by a gap shorter than at least one of the segments and no more than 5 cM long were merged; any merged segment that did not contain a subsegment with score below 1 × 10−9 was removed.

The statistics WAB defined by Atzmon et al.20 and LAB defined by Botigue et al.21 were computed as a summary of IBD sharing. The first is an index of the total length of the shared IBD blocks averaged over the number of possible pairs of individuals (one from population A and the other from population B); the second is an index of the average length of a segment shared IBD between a pair of individuals normalized over the possible number of pairwise comparisons between populations. A block jackknife procedure was used to compute standard errors and confidence intervals for both estimates.

Isolation-by-distance and genetic boundary tests

The Mantel test22 was applied to the correlation between geographical distance (the shortest distance on a roadmap) expressed in kilometres (km) and genetic similarity measured by the W statistic defined above. Statistical significance was evaluated by computing an empirical P-value based on 10 000 permutations. To verify the presence of genetic boundaries across the Italian regions, we used Monmonier’s algorithm.23 Statistical significance of the genetic boundaries was computed as suggested in Manni et al.23 Briefly, the test is based on analysis of resampled bootstrap matrices: a score is associated with all the different edges that constitute barriers and indicates how many times each edge is included in one of the N boundaries computed to determine an empirical P-value.

Effective population size

The effective population size (Ne) was inferred for each of the three Italian macroareas (Northern, Central and Southern Italy) and each of the 11 Italian regions separately, taking advantage of the relationship between the number and length of shared IBD segments and Ne as described by Palamara and Pe’er.24 Specifically, we estimated the ‘ancestral’ Ne from the number of short (<2 cM) IBD shared segments only.

Admixture analysis

The unsupervised algorithm implemented in the software ADMIXTURE 1.2325 was used to infer ancestry proportions for all the Italian, European, Middle Eastern and North African samples. Nine ancestral cluster arrangements (K=2,…,10) were tested, and the cross- validation error procedure implemented in the same software was used to define the most realistic contributions of ancestral populations to the currently observed pattern. The procedure was repeated 20 times and the results were averaged. To avoid bias due to differences in sample sizes among populations, this analysis was performed with a balanced sample size (20 individuals per population). Results were further validated using an independent methodology implemented in the software RFMIX.26 See Supplementary Methods for details.

Fst genetic distance

The genetic distance between all population pairs was assessed by computing the Fst (fixation index) measure, as implemented in the R package snpStats.27 Confidence intervals were obtained by means of a block jackknife procedure.

Time since admixture events

To estimate the time of admixture events, we used the extension of the ROLLOFF method implemented in the software ALDER.28 For this analysis, the Italian regions were grouped according to the five previously identified clusters (Northern, Central, Southern Italy, Aosta Valley and Sardinia), and admixture was tested against each of the non-Italian populations included in the study. See Supplementary Methods for details.

Results

Italian population substructure

The fineSTRUCTURE coancestry matrix is shown in Figure 2a. The pie charts on the bottom panel show the overlap between the observed clusters and the regions of origin. The sensitivity of the cluster algorithm in assigning each sample to the correct macroarea was 96.43%, 86.55%, 92.00%, 94.44% and 100% for Aosta Valley, Northern Italy, Central Italy, Southern Italy and Sardinia, respectively, whereas the specificity was 98.16%, 96.68%, 94.80%, 99.56% and 100% respectively. In view of the above results, the Italian regions were divided into five groups for the statistical analyses: Northern, Central, Southern Italy, Aosta Valley and Sardinia.

Figure 2
figure 2

Heatmap representing the coancestry matrix indicating the number of genomic segments inherited from the same ancestral populations for each pair of samples: sendrogram based on hierarchical clustering (at the top), and pie charts representing the overlap between inferred and self-reported origin of the Italian individuals (a). Principal component analysis based on the same coancestry matrix including Sardinians (b) and excluding Sardinians (c); x and y axes were inverted to emphasize similarity to the geographical map of Italy.

The distribution of the Italian individuals from the first two eigenvectors of the PCA is shown in Figure 2, both including and excluding Sardinians (Figures 2b and c). The PCA results provided evidence of large differences between Sardinians and other Italians, and the presence of a genetic gradient across mainland Italy. By performing a genome-wide scan using the first four eigenvectors as independent outcomes regressed against each SNP used as a predictor, we found that the first four PCs strongly correlated with well-known loci under selective pressure, such as the HLA-A complex (hg19 chr6:g.21 266 925_32 628 428),29 and the locus related to the lactase persistence phenotype (hg19 chr2:g.135 907 088_137 013 606);30 and loci with a low recombination rate such as the hg19 chr8:g.8 094 406_11 860 625 region harbouring a known polymorphic inversion in Europeans31 (see Manhattan plot – Supplementary Figure S1). In view of the above, we repeated the analysis excluding the above loci, obtaining the same pattern of variability across mainland Italy (Supplementary Figure S2): the correlation between the first two PCs computed with and without these loci was >0.95. ADMIXTURE analysis was also performed on the 300 Italians: see Supplementary Material and Supplementary Figure S3 for a description of the results.

Shared IBD haplotypes across Italy

We found greater sharing of IBD segments within regions than between regions from both the W and L statistics (Supplementary Tables S2A and S2B), although in some cases the differences were not statistically significant (data not shown). The Mantel test showed a significant correlation between the geographical distance and the total length of shared IBD segments, both when including Sardinia (R=–0.483, P=0.0039), and when excluding Sardinia from the analysis (R=–0.622, P=0.0027). Based on Monmonier’s algorithm, a genetic barrier was identified between Sardinia and mainland Italy (empirical P<0.0001), whereas there was no evidence of any statistically significant genetic barrier within the peninsula.

A significant southward trend of increasing population size Ne (Supplementary Figure S4A) was found (P for trend<0.0001) when the mainland Italians were grouped according to the three main macroareas: Northern, Central and Southern. From single region results (Supplementary Figure S4B), the lowest estimated ancestral Ne values were for Sardinia and the Aosta Valley (<5000), accompanied by a high rate of inbreeding, whereas the effective population sizes were rather homogeneous across the remaining regions.

Comparison with neighbouring populations

We first used PCA (Supplementary Figure S5) to investigate genetic differences across 35 populations from Europe, the Middle East and North Africa. The projection of the first two eigenvectors reflects well the geographical origins of the subjects included in the analysis (see Supplementary Results for details). We further investigated population structure using ADMIXTURE. Three major components are noticeable in the Italian population, with different proportions among the major Italian macroareas (see Supplementary Results and Supplementary Figure S6 for a more detailed description).

The distribution of the pairwise Fst distances between all population pairs is shown in Supplementary Table S3. The genetic distance between Southern and Northern Italians (Fst=0.0013) is comparable to that between individuals living in different political units (ie, Iberians-Romanians Fst=0.0011; British-French Fst=0.0007), and, interestingly, in >50% of all the possible pairwise comparisons within Europe (Supplementary Figure S7).

Finally, when comparing IBD segment sharing between Italians and the other populations, both with total length W and the average length L, we observed that Southern Italians share more IBD with North African and Middle Eastern populations, whereas Northern Italians share more IBD with Europeans, as is predictable by geography (see Supplementary Results and Supplementary Table S4 for details).

Time since admixture events

We estimated time since admixture events from LD decay as a function of genetic distance.

We found evidence of the presence of a mix of Central-Northern European and Middle Eastern-North African ancestries in the Italian individuals (Supplementary Table S5). The estimated times of admixture ranged between ~2050 and 1300 years ago (y.a.), with an average of about 1650 y.a. – assuming 29 years per generation32– for Northern Italians, and between ~3000 and 1450 y.a. (~2100 y.a. on average) for Central Italians. Finally, for the Southern Italian individuals, admixture between European and Northern African-Middle Eastern ancestry was estimated to have occurred about 1000 y.a. (see Supplementary Table S5 and Supplementary Results for a complete report of significant results).

Discussion

We evaluated fine-scale genetic differences within the Italian population using a large set of SNPs genotyped in more than 300 Italian individuals and compared them with published genotype data for 1272 European, Middle Eastern and North African individuals. Our study focused on only 11 of the 20 Italian regions and, in particular, lacks representation from the Eastern part of Italy. On the other hand, some of the strengths of this study are the sample selection criteria based on typical surnames and the place of birth of the four grandparents, which avoids the inclusion of individuals whose origins are different from their place of birth. In a previous study,1 we provided a first overview of the genetic composition of Italians, selecting individuals based on the place of birth only, but we were not able to discriminate between Northern and Central Italians. We observed that a proportion of individuals born in Northern Italy clustered with Southern Italians. This was explained by the internal migration that occurred during the last two generations, when people from Southern Italy left their place of origin looking for better economic opportunities in the North.

Genetic gradient across mainland Italy

Several of our analyses revealed that the genetic structure of Italians varies to a large extent but that it can be used to assign each mainland Italian to the correct macroarea of origin with very good sensitivity and specificity. The few misclassified individuals could be because of incorrect self-reported origin of mother or grandparents from the maternal line (sampling according to typical surnames makes the paternal line more likely to be correct), or to different origins of the great-grandparents about whom we have no information. The PCA, IBD and ancestry analyses revealed a genetic gradient across the peninsula that correlates with geography. As the diversity gradient in Italy remained after excluding loci under selective forces or with low recombination rate, we may speculate that historical events have a complementary role in explaining the great genomic variability within the Italian population. Fst genetic distance between Northern and Southern Italians is comparable or even higher compared with differences observed among individuals living in different countries, further confirming the high genomic variability within Italy. We also replicated the previously described gradient of SNP allelic frequencies on the hg19 chr2:g.135 907 088_137 013 606 locus,30 related to the lactase persistence phenotype, which strongly correlates with the second PC, and which in turn reflects the geographical location.

Continuous gene flow or different ancestral populations

We hypothesize two simple historical scenarios leading to the observed genetic variability across Italy: (a) continuous ancient gene flow amplified by isolation-by-distance in recent times; (b) different ancestral origins of the main Italian macroareas whose distinguishability has been attenuated by genetic exchange in recent times.

Monmonier’s algorithm revealed no evidence of the presence of genetic barriers across the peninsula. Instead, results from the Mantel test provide evidence of a correlation between genetics and geographical distance. The observed higher average length of the segments with shared IBD within regions compared with those shared between regions (Supplementary Table S2B) suggests recent isolation-by-distance across the wide range of latitude of the Italian peninsula. Moreover, a North to South gradient of increasing ancestral Ne was inferred for the three main macroareas (Northern, Central and Southern), coinciding with increased heterozygosity in Southern Italy. A similar trend was previously described for the rate of inbreeding and genome-wide similarity across Central Europe,33 and could be interpreted as a signature of the ‘Out of Africa’ migration during Palaeolithic, expansions from refugia after the ice age and of ancient South-to-North migratory waves that occurred at the times of European colonization by Neolithic farmers. The ancestry and IBD analyses provided evidence of admixture in Italy with three major ancestries detected, most represented in Northern Europeans, Southern Europeans and Middle Eastern, respectively (with a small percentage of a North African component found in South Italy and Sardinia), with different prevalence across the peninsula. None of these components is fixed in any population, meaning that there is a poor fit with a strict admixture model, as assumed by the algorithm used, and supporting a process of continuous gene flow in multiple directions (migratory waves to and from Italy). According to previous studies on the Y chromosome and mtDNA,34, 35 the Middle Eastern ancestry in Southern Italians most likely originated at the time of the Greek colonization and, with a smaller percentage, of the subsequent Arabic domination,7 whereas in Central-Northern Italy it is possibly because of the admixture of the indigenous residents with Middle Eastern populations spreading from the Caucasus to Central Europe.19, 21, 28, 36 Our results agree with previously published reports describing a possible maritime route of colonization across Europe, including Italy,37 although we cannot exclude the occurrence of more recent demographic events leading to a similar scenario. Finally, the homogenous ancestral effective population size across Italian regions could be interpreted as reflecting common genetic origins, taking also into account previous considerations, although the same results might also occur in comparing populations without common origins.

Our study supports the notion that genetic variability across Italy is likely to represent continuous gene flow leading to differences in the proportion of ancestry from different sources, along with genetic exchange among neighbouring populations (eg, Northern Italian with European countries, Southern Italian with Middle Eastern and North African ones). Previous studies, analysing uniparental markers, found Y-chromosome genetic discontinuity across Italy. This contrasts with a general lack of structure for mitochondrial DNA,2, 4 and with a higher homogeneity for maternal than paternal genetic contributions, suggesting different demographic and historical dynamics for females and males in Italy.

Sardinia

Among the 11 Italian regions investigated, Sardinia deserves a separate discussion. We replicated previously reported results showing the large genetic differences between Sardinians and mainland Italians;29 the occurrence of a genetic barrier between Sardinia and the rest of Italy; the previously described similarity between the ancestries of the Tyrolean iceman and Sardinians;36 and the very large allelic frequency differences between the islanders and the mainland Italians for the SNPs located on chromosome 6, in the HLA-A complex locus involved in the immune response,38 which may at least partly explain the increased prevalence in the island of immune and autoimmune diseases, such as type 1 diabetes and multiple sclerosis.

The Aosta Valley region

The Aosta Valley is a small region at the North-Western Italian border with France and Switzerland. Its inhabitants showed interesting genetic characteristics. In the PCA (Figures 2b and c and Supplementary Figure S5), Aostans do not cluster with subjects from the other Northern Italian regions, not even after the inclusion of non-Italian populations in the analysis. Moreover, IBD analysis revealed a high level of inbreeding, comparable to the rate observed in Sardinia. However, the estimated proportions of ancestry are comparable to those in Piedmont, Liguria, Lombardy and Emilia Romagna. Hence, our results suggest that the observed differences are not because of the effect of long genetic isolation, as is the case for Sardinians, but are the results of recent isolation of the Valley. These results are consistent with the low number of different surnames, mostly of French origin, compared with the number of families, as expected in genetic isolates.39

Time since admixture

The overall procedure to estimate time since admixture with the ALDER software is strongly conservative and is based on the assumption of the simplified model of admixture we used – that is, a single pulse from discrete sources. Our previously discussed results favoured the hypothesis of continuous gene flow across Italy with admixture events that have likely occurred multiple times, and it should be noted that the method used is designed to emphasize the most recent admixture event.28 Our estimated admixture dates agree with recent literature and several historical events. For example, our results support the hypothesis of an admixture event that occurred about 3000 y.a. involving populations coming from the Caucasus, the Middle East and populations that lived in Central Italy (Tuscany and Latium), as previously reported from analyses of mitochondrial and Y-chromosome DNA,40 and genome-wide data.41 This was interpreted as possible evidence of the Middle Eastern (Anatolian) origin of the Etruscans (Herodotus’ theory). Admixture events introducing Northern-Central European ancestry into Italy were estimated to have occurred during the so-called ‘Migration Period’ after the Roman Empire collapsed (476 CE), with the consequent decline in population. After that the ‘Barbarian invasions’ took place, with migratory waves from Northern-Central Europe to Northern-Central Italy. It may be speculated that the estimated Northern-Central European ancestry in contemporary Italians is also the effect of subsequent Italian population growth, as previously reported by studies on mitochondrial and Y-chromosome DNA from a genetic isolate in Northern Italy,42 suggesting that Germanics (Lombards in particular) settled in Northern Italy during the ‘Migration Period’ and may have contributed to the foundation of some communities in Northern Italy. Finally, admixture events involving the Southern Italian population were inferred to have occurred about 1000 y.a., coinciding with The Norman conquest of Southern Italy that spanned most of the eleventh and twelfth centuries and involved many battles and independent conquerors. A much more detailed analysis of geographically well-distributed samples from Southern Italy is required to validate our findings, while a large genetic contribution to the island of Sicily from Greece has previously been estimated.43

Conclusions and future directions

To our knowledge this is the first study to investigate genetic variability within the Italian population using a very large number of SNPs and subjects with well-defined geographical origin. We used these data to make inferences on population substructure and admixture events in Italy. To achieve a more complete picture of Italian genetic history and composition, future work should include Italian regions not covered in this study (Eastern regions such as Apulia and Veneto) and should compare Italians with other European populations (Greeks in particular). Our study could be useful for further genetic, epidemiological and forensic studies in Italy, as it may provide a set of valuable healthy controls for genome-wide association studies, and may be useful for identifying ancestral informative markers. It might also help to explain the North-South prevalence gradient reported in Italy for several types of tumours.44