Introduction

The use of population isolates has proven valuable to map loci coding for complex traits (eg, Holm et al,1 Sulem et al2 and Thorgeirsson et al3). Genetic isolates present key features that simplify gene mapping, namely reduced phenotypic and environmental variance, and reduced genetic heterogeneity.4, 5 Genomes of individuals from isolated populations tend to be more homogeneous compared with other populations, reflected by a small effective population size (Ne or the effective number of individuals required to explain the observed genetic variability).6 In population isolates, a small Ne may arise as a consequence of a founding event (ie, the settlement of a new territory) and it is maintained through time owing to the absence of gene flow (migration) with neighbouring populations. In this scenario genetic drift (the random fluctuation of allele frequency at each generation) can lead to quick significant reduction of extant variability and the frequency of disease or trait-associated variants can increase because of drift, thus facilitating gene mapping.7

Another key property of population isolates is the large extension of regions in linkage disequilibrium (LD).8 Isolates are relatively young compared with the population of origin, and usually originated from a small founding nucleus of individuals, two conditions that create association between loci that are far apart from each other. In addition, because of the small Ne often recombination takes place between identical haplotypes, further increasing the range of significant LD. As a consequence, any two individuals in the population tend to share potentially long chromosomal segments identical by descent, facilitating long-range haplotype matching, genotype imputation9, 10, 11 and reconstruction of population-specific recombination maps.

The Val Borbera is a geographically isolated valley within the Appennine Mountains of Piedmont (North-West Italy). According to genealogical records, about 3000 individuals (the majority of the current population) descend from inhabitants of the valley in the 17th century. Previous demographic and epidemiological analyses highlight features characteristic of genetic isolates, including a high percentage (>80%) of marriages between individuals within the valley in the last four centuries and family clustering for some traits of medical interest.12 However, the extent to which the Val Borbera population is a genetic isolate is unknown. Furthermore, it is not clear to which extent the valley’s seven villages can be considered as a single population or whether they form distinct units of a meta-population. In this study we used nuclear and mitochondrial DNA (mtDNA) data to explore the extent of genetic variation of the valley and to investigate population structure. Implications of results for gene-mapping studies are discussed.

Subjects and methods

Samples

A total of 1800 healthy individuals, spanning 18–102 years of age, gave informed consent to participate in genetic analyses. Birth, marriage and death records from the 16th century onwards have been collected and used to reconstruct pedigrees from which a pedigree-based kinship coefficient was calculated.12, 13 Data collection and genotyping of the cohort was approved by the institutional ethical committee of the San Raffaele Hospital in Milan and by the Regione Piemonte.

We used kinship information to exclude all individuals related as first-cousin or more using a custom algorithm that implements recursive removal on the basis of kinship information. We felt confident in using pedigree kinship as it has been shown to be highly correlated with genomic one.13 After removal of close relatives we classified individuals according to two criteria: (i) the four grandparents were resident in any one of the villages in the valley; and (ii) all four grandparents were resident in the same village. As detailed in Table 1, the first criterion allow us to select 267 individuals to form the ‘valley’ sample, while according to the last criterion we selected single-village samples. One of the seven villages (ROC) was excluded from the analyses because of a very small sample size (Table 1).

Table 1 Populations in this study

For comparison we added to our analyses genetic data from other reference populations (Table 1). We downloaded nuclear genotype data relative to the three populations in the HapMap collection (The International HapMap Project, Release 27, NCBI build 36). Two samples are of European origin, namely CEU (Utah residents with Northern and Western European ancestry) and TSI (Tuscans). The third sample is YRI (Yoruba) from Nigeria, Africa. We removed from CEU samples presenting cryptic relatedness as previously described.14 Finally, we added a fourth reference population, consisting of a cohort from Veneto Region, (North-East of Italy) with no apparent history of geographical isolation.15

For the mtDNA analyses, we referred to two Piedmontese populations in close geographical proximity to the Val Borbera (Table 1), namely Trino Vercellese and Val di Susa, whose 76 mtDNA control-region data are reported here for the first time. Finally, we also included published mtDNA control-region data from the Saami, a northern population known to be a genetic outlier among Europeans.16

Analyses of nuclear data

Data sets of genotypic calls at single-nucleotide polymorphisms (SNPs) were available for both the valley study cohort that was genotyped with the Illumina (San Diego, CA, USA) 370k-Quad CNV array, and the genomic reference17, 18 populations. All the data sets were filtered to retain variants that satisfy the following criteria: (a) MAF≥0.01; (b) genotype call rate >97% for markers with minor allele frequency (MAF) above 5% and genotype call rate >99% when 1%<MAF<5%; (c) Hardy–Weinberg equilibrium (HWE) P-value>0.00001. In the Valley sample HWE was calculated in a subset of individuals with probability of identity by descent >0.185. Merging of all cohorts led to overlapping 168 542 genome-wide SNPs that were used for subsequent analyses.

Pairwise genetic distance

Allele frequency differentiation (FST) between pairs of populations was calculated at each locus as σ2/π (1−π), where π is the mean allele frequency and σ2 the variance.19 Allele frequencies were estimated by allele counting in the valley sample.

Analysis of population structure

Population structure analysis was performed by means of Principal Component Analysis (PCA) 20 and genetic clustering.21 As both methods assume markers to be independent, we pruned from the genome-wide SNP set all SNPs in high LD (defined by r2≥0.4 in the valley) using MASEL,22 leaving 25 696 SNPs. This SNP number is appropriate for PCA to reach significance. Indeed, even the smallest FST (0.007 between VER and TSI, Table 2), is one order magnitude above the threshold value of 0.001 that we calculated as 1/sqrt (25 696*40), where 25 696 is the number of markers and 40 is the minimum number of chromosomes considered.20

Table 2 Pairwise genetic distance (FST) between populations (average of all nuclear loci)

We used PCA to summarise SNP genotype information at the level of each individual, with the aim to explore the relationships between individuals within populations and between populations. PCA was performed using Eigenstrat.23 For each component we calculated formal P-values for the presence of population substructure according to the Tracy–Widom (TW) distribution (with β=1) as described in Patterson et al.20 In order to avoid bias owing to unequal population sizes,24 we randomly sampled 88 and 20 individuals from each population when considering the valley and the villages, respectively. We had only two exceptions for ROA and ALB for which we only had available 13 and 19 individuals, respectively.

Model-based clustering analysis tests the presence of different clusters (K) in a meta-population. We applied unsupervised (ie, without prior information) clustering analysis to the whole-sample set, exploring the hypotheses of K=1 to 10 clusters using ADMIXTURE.21 Cross-validation errors for each hypothesis were calculated as described in Alexander et al.21

Run of homozygosity (ROH)

For the ROH analysis we randomly sampled 84 and 36 individuals from each population when considering the valley and the villages respectively. Similar sample size was used for the other reference populations. Genotypic data were analysed with the PLINK package version 1.0725 under default settings (ie, sliding windows 5 Mb, minimum 50 SNPs, one heterozygous genotype and five missing calls allowed). Each SNP is considered to be part of a homozygous segment when the proportion of overlapping homozygous windows is above 5%. ROHs were defined as stretches of at least 0.5 Mb with at least 25 homozygous SNPs (maximum pairwise distance=100 Kb).

LD calculation and estimate of effective population size from nuclear data

Pairwise LD was calculated using the squared correlation (r2) in genotype frequencies between 49 353 autosomal SNPs from six random chosen chromosomes (1, 3, 7, 10, 18 and 22) using PLINK.25 For all populations we estimated Ne from LD.26, 27, 28 Indeed, the expected LD value at a certain recombination distance (c) is inversely proportional to Ne and to c itself, and thus it is possible to derive Ne from LD values given that the recombination distance between the loci is known. Furthermore, recombination distance between markers is inversely proportional to the number of generations through which markers have been inherited together according to the formula t≈1/2c,28 and thus estimates of Ne at different times are possible taking into account different classes of recombination distances. One of the limitations of this approach is that the extent of recombination intervals that can be taken into account depends on the sample size, as within bins r2 is adjusted for the size of the sample used to calculate LD (r2=r2−1/n, n=sample size). Therefore, meaningless negative estimates of Ne are produced when r2 is lower than 1/n.26

In all populations, pairwise LD values separated by genetic distances comprised between 0.0625 and 0.35 cM were binned into distance categories and their average r2 was considered. This range of genetic distance offers a view of time from 20 000 to 3500 years ago (y.a.) considering 25 years generation time.29 For populations in the valley, because of highest levels of LD over recombination distances it has been possible to further extend calculations between 0.0125 and 1.25 cM providing nuclear effective populations size estimates (nNe) until 1000 y.a. Confidence intervals around estimates were derived considering chromosomes as replicates.

Analyses of mtDNA sequence data

We sequenced and analysed 360 bp (HVS-I, from np 16 024–16 383) of the mtDNA-control region (Supplementary Table 1). Sequencing was performed as previously reported.30 Within each village we randomly selected 40 individuals for which the mtDNA sequence was available or imputable using the pedigree information. Indeed, we exploited pedigree data to infer mtDNA sequences within matrilineal pedigree segments of depth of up to five generations. We are aware that this approach ignores very recent mutations. However, the loss of information is negligible as all the villages have low effective population size and the number of generations in which we assumed no mutations occurred is never higher than five.

Estimate of effective population size from mtDNA

Changes of the mtDNA effective population size (mtNe) through time were reconstructed using the extended Bayesian skyline plot (EBSP) as implemented in the BEAST software v.1.6.231 and the Hasegawa, Kishino and Yano model of nucleotide substitution.32 EBSP is a non-parametric Bayesian-based coalescent approach that makes no assumption on the demographic model of the population.31 Each coalescent interval has its own prior mtNe distribution, which is sampled during the Monte Carlo Markov Chain (MCMC), together with the coalescent tree, the branch lengths and the evolutionary parameters.33 After the removal of the burn-in, mtNe is evaluated at some specified time point on the recorded iterations of the MCMC and then interpolated to obtain its variation through time. Length of the MCMC was set to 20 000 000 iterations with a 10% burn-in and a thinning interval of 1000 to ensure all parameters to have an effective sample size above 200. Mutation rate was set to 1.3 × 107 (roughly equivalent to that in Forster et al34) and generation time to 25 years.29 To check for convergence, each analysis was run at least twice. Input files for BEAST are available upon request.

Results

Population clustering

We calculated genetic distances between pairs of populations at all available nuclear loci (Table 2). The results show comparable genetic distances among the populations of European ancestry, with apparently slightly higher structuring within the valley (Table 2).

We used nuclear data to perform analyses at the individual level using both villages and valley samples. Figure 1 shows a plot of the first two principal components from randomly selected samples of equal size and after LD corrections. When considering single villages, (Figure 1b) the first two components – explaining less than 10% of the genetic variance – show significant discrimination of populations (TW P-value <0.001, Supplementary Table 2), the first one separating African from non-African populations, and the second distinguishing the valley from the other European populations. The same pattern is observed when considering the valley as a whole (Figure 1a), but only the first component is significant in this case. Model-based clustering analysis of the same data set (Figure 1c and d, Supplementary Figures 2–5) revealed K=4 as the most likely number of clusters in both cases (valley and villages, Supplementary Figure 1). Graphical representation of the proportion of ancestry in each cluster per each individual (Figure 1c and d) shows how the three main components distinguish Africans, Europeans and the valley. The fourth component that distinguishes two clusters in the MON village (Figure 1d) seems not relevant for the valley analysis, although it is the most probable one (Supplementary Figure 1).

Figure 1
figure 1

Population clustering analyses. Model free (a, b) and model-based genetic structure analyses (c, d) relative to the valley (a, c) and the villages (b, d) samples, with respect to reference populations (see Table 1 for a list of abbreviations). In figures (a) and (b) percentages within brackets on the two axes indicate the explained variance, while P-values indicate component’s statistical significance. Both the analyses at valley and village levels show similar patterns and both clustering methods suggest poor recent genetic exchanges between the valley and the other Italians and European populations.

Long segments of autozygosity and shared haplotypes within villages with respect to other populations

ROHs are stretches of consecutive homozygous genotypic calls at adjacent SNP loci in an individual’s genome. The extent of ROHs of a genome provides a good estimate of its autozygosity at both individual and population levels. Frequent (10–13% of the genome) ROHs of short length (less than 100 kb) and less frequent ROHs of moderate length (up to 4 Mb) are expected to be found in individuals from outbred populations.35, 36, 37 Longer ROHs provide evidence for past consanguinity and population isolation.37, 38 Figure 2 presents distribution of ROHs in the different populations according to their size (in Mb). Both when considering the villages and the valley, the distribution of ROHs appears to be less left-skewed compared with other populations, suggesting a higher proportion of individuals with extended regions of autozygosity.

Figure 2
figure 2

Statistics on the extension of the ROHs in the valley (a) and villages (b) with respect to other populations. ROHs were binned according to length and per each population the percentage of individual having at least one ROH of a given length is indicated on the y axis. As the length of the ROHs increases different trends are visible for the isolate and the reference populations. Villages behave similarly among them (b). See Table 1 for a list of abbreviations.

As a further indication of genetic isolation, we estimated the decay of LD according to recombination distance between markers. As Figure 3 shows, villages harbour the highest levels of LD compared with other populations, even for long recombination distances, similarly to other isolates.39 Interestingly, the valley sample shows an opposite trend compared with the villages, indicating that shared haplotypes tend to be longer ‘within’ villages than ‘among’ villages.

Figure 3
figure 3

Decay of LD with increasing recombination distance measured as average r2 within recombination distance bins. As expected the population with faster decay is the YRI, whereas on the opposite villages show slower decay. The trend is inverted for the valley meta-population (dark blue) that is even faster than the YRI at recombination distances >0.15 cM.

No traces of recent expansion in populations from the villages

Using nuclear data we estimated nNe from LD. The presence of different recombination distance classes allowed us to obtain estimates at different times in a window of 20 000–1000 years ago. As Figure 4a shows, effective population size in villages is generally lower than other populations, and never exceeded 5000 individuals. Estimates for reference populations are consistent with previous analyses using the same method,26 and the trends reflect known demographic events:40, 41 a recent expansion for non-African populations and almost constant size for the African one. A similar trend is observed in the valley meta-population, whereas no signs of recent expansion can be seen in single-village samples. On the contrary, apparently a recent decline in nNe took place from 4000 years onward (Supplementary Figure 6). To further clarify this point we calculated average nNe before and after 4000 years ago for villages and the valley. As Supplementary Figure 7 shows, opposite trends took place in the villages, and the valley.

Figure 4
figure 4

Estimates of effective population sizes. Upper graph (a) shows estimates from nuclear DNA in the last 20 000 years. Notably, villages have the lowest values of Ne and there is no signature of the recent population expansion as in the reference populations. An opposite trend is visible for the valley sample (dark blue). Finally, YRI shows a constant trend and the highest effective population size, in agreement with other estimates as mentioned in the main text. A similar trend is shown when considering estimates from mtDNA (b) data. However, given the different properties of mtDNA and the different methods we used with respect to nuclear DNA, the two estimates are not directly comparable.

We also estimated villages’ mtNe values from mtDNA (Figure 4b) and compared them with two non-isolated nearby populations of Piedmont (TRV and VDS). Two main features emerged from this comparison. First, the modern effective population size in villages is generally lower than in other populations, never exceeding 10 000. In contrast with nuclear estimates, there is higher variance among villages. The lowest mtNe is found in CAR, where mtNe is slightly above 2000, about three times smaller than Saami (a traditional isolated group 16). The second feature revealed by the EBSP analysis is a constant demography for villages, again in accordance with nuclear estimates. Conversely TRV and VDS show an increase of mtNe in the Upper Palaeolithic/Neolithic similarly to other European populations.42 Surprisingly ROA shows a different behaviour with respect to other villages. We believe this is owing to stochasticity in the reconstructed coalescent processes.

Discussion

This study shows, on the basis of several lines of evidence, that the population of the Val Borbera is a genetic isolate. First, allele frequencies summarised by PCA do not match other geographically close European and Italian populations. Our comparative analyses showed that, at the nuclear level, samples from the valley form a separate cluster from other European populations, including a northern Italian one. This indicates consistent differences in allele frequency distributions, and points to the occurrence of limited recent gene flow between them. Secondly, we observed extended regions of autozygosity with respect to other populations. In agreement with a previous study on the valley population,12 this feature indicates an excess of shared recent ancestry, suggesting that mating among recently related individuals has taken place in past generations, a condition most likely to occur during genetic isolation. Using both nuclear and mitochondrial markers we estimated a very small effective population size for the villages, suggesting a possible effect of genetic drift in reducing genetic variation within villages. Estimates of mtNe were overall greater than nuclear ones. This is some way counterintuitive when considering that mtDNA is haploid and maternally transmitted and thus should in principle be more prone to genetic drift. However, a direct comparison of the nNe and mtNe estimates is not possible as they have been produced with two different methods. Estimates of mtNe depend on knowledge of mitochondrial mutation rates and the confidence intervals of our EBSP analyses are quite large. Similarly, computing nNe from LD relies on simplifying assumptions.26 However, we are interested in the trend of population size changes through time rather than on their exact values, and thus we can be confident about our relative conclusions. Finally, contrary to other European populations, we observed a recent effective population size decline, suggesting either that the isolation is still in action or that consequences of past isolation are still present in the nuclear genome of the sampled individuals.

The second main finding of our study is that slight structuring is present among villages, within the valley. Despite clustering analysis of the villages showing no significant stratification (P-value >0.05 for both first and second principal components, Supplementary Figure 8), FST values indicate some extent of structuring, which has already been observed in isolates,43, 44 even for populations with recent shared genealogy.45 We speculate that the slight observed stratification can be related to the high proportion of marriages occurring between inhabitants of the same village, as demonstrated by analysis of marriage acts and surnames (data not shown). Further, we observe a more rapid decay of LD in the valley with respect to villages and opposite trend of LD-based estimates of nNe consistently with meta-population dynamics.46 Indeed, theoretical and simulation studies46, 47, 48, 49, 50 have demonstrated that the genealogy of lineages sampled from a deme belonging to a meta-population display a shift in the site frequency spectrum towards more intermediate frequency variants and an increase in LD compared with an unstructured population. This shift is much less pronounced when pooling lineages from more demes. This observation clearly shows how the valley is not a single panmictic unit but rather behaves as a meta-population. This finding is crucial for future gene-mapping studies, as it might help defining the unit of sampling.

Our study demonstrates that isolation took place in valley and provides insights for further gene-mapping studies. The Val Borbera population genetic and phenotypic data have been successfully used in genome-wide association meta-analyses,51, 52, 53 the first step in the identification of gene underlying complex traits in which rare gene variants are hardly identified. Isolates provide a unique opportunity to overcome this issue since rare variants frequency might be shifted towards high values. We have demonstrated that genetic drift has had a large impact on Val Borbera population and thus we expect many variants (among which some might be of relevant medical interest) that are rare in the general population to reach significant frequency values in the valley. Further the slight structuring observed might in principle allow a more fine analysis of rare frequency variants at the level of villages.

Overall, the genetic data available allowed us to investigate structure at a good resolution. However, a more accurate investigation of events that took place on a shorter time scale remain to be investigated when genomic sequence data, free of ascertainment bias, will make rare variants data available.