Introduction

Population isolates have not only provided insights into population diversity and history, but are also an exciting opportunity to identify rare and low-frequency variants associated with complex diseases.1, 2, 3, 4 Regardless of whether looking across the whole genome or focusing on genetic variation in the coding regions, these studies have consistently observed the highest enrichment in the variation that predictably disrupts protein coding genes.

Within coding regions, variant alleles that have high penetrance whilst predisposing to disease are likely to be deleterious and therefore kept at low frequencies by purifying selection in larger outbred populations.5, 6, 7 Isolated populations resulting from recent bottlenecks have a substantial reduction in rare neutral variation and also many functional and even deleterious variants present at relatively higher frequencies because of increased drift and reduced selective pressure. Hence, recent isolates can be used to study causal variants that are rare in other populations in association with complex diseases.1, 2, 3, 4

Finland is a well-known example of an isolated population where multiple historical bottlenecks resulting from consecutive founder effects have shaped the gene pool of current-day Finns.8 Previous studies suggest the latest historical migration into Finland ~4000 years ago.9 Owing to lack of evidence of major migratory movements, it has been suggested that there were small but significant migrating groups of people.

Settlements resulting from the latter migratory movements mainly occurred along the south–east cost of Finland. Further, due to geopolitical reasons there have been additional major migratory movements within Finland in the 16th century in the eastern and northern parts of Finland. These settlements, initially founded by a small number of people, have grown in size over time leading to secondary population bottlenecks. An extreme example of the latter is Kuusamo, a county in the northeast part of Finland.10, 11 Historical records show that in 1718, there were 165 houses consisting of 615 individuals belonging to 39 families. Rapid population growth leading to a present day population of >15 000 individuals further increased the allelic drift in this sub-isolate. Consequences of these historical events have led to reduced genetic variation and higher overall linkage disequilibrium levels in Finland as compared with the outbred populations.10, 11

During the last 1000 years, the Finnish population size has grown more than two orders of magnitude – from around 50 000 individuals to more than 5 million individuals. Furthermore, the most rapid growth has happened during the last 10 generations (~250 years), with population size growing from 500 000 to 5.4 million individuals. Combined with the historical bottleneck effect, these events have caused a massive departure from population genetic equilibrium whilst ‘shifting’ the proportion and frequency of many initially rare variants.

Such deviations have led to an increase in the prevalence of some monogenic Mendelian disorders in Finland as compared with the other parts of the world and are referred to as the Finnish disease heritage12 (FDH). Pronounced effects of the bottleneck have also been observed for complex diseases and disorders. For instance, schizophrenia is prevalent almost three times in northeastern sub-isolates as compared with rest of Finland.13 Similarly, protective effects of enriched variants have also been observed as exemplified by variants in the LPA gene that protect against risk of cardiovascular diseases.1

However, the dynamics and properties of this genetic 'enrichment' are poorly understood on the genome scale, particularly outside the protein coding regions. We set out to provide a more comprehensive view of this enrichment and the other bottleneck effects in Finland by comparing whole-genome sequencing data in Finnish and British samples. In this study, we show how the historical bottlenecks have affected the genetic landscape of Finns and the frequency profile of variants across the entire genome. Whole-genome sequencing data gave us a unique opportunity to determine the enrichment of variants across both coding as well as non-coding regions of the human genome.

Methods

Sample selection

We sequenced the whole genomes of unrelated 1463 Finns at low coverage (~4.6 ×). These samples belonged to the FINRISK14 and H2000 cohorts. The FINRISK study comprises samples of the working-age population, to study the risk factors associated with chronic diseases across Finland and is carried out every 5 years. The H2000 is a population-based national survey aimed at studying the prevalence and determinants of important health problems amongst the working-age and the aged population (http://www.terveys2000.fi/julkaisut/baseline.pdf). Amongst these, 856 individuals have low HDL and 691 individuals have been diagnosed with psychosis. Further, 371 individuals belong to a sub-isolate within Finland, the Kuusamo region. Due to known genetic differences, for the comparison between Britons and the Finns, we restricted the analyses only to those Finnish individuals that are not from Kuusamo. To study the effects of a bottleneck within a bottleneck, 371 samples from non-Kuusamo Finns were further used for comparison against 371 samples from Kuusamo. All study participants gave their written informed consent to the study of origin.

Whole-genome sequencing and variant discovery

Low read-depth whole-genome sequencing was performed at the Wellcome Trust Sanger Institute (WTSI). Joint variant calling of the raw binary sequence alignment map (BAM files) along with the UK10K samples was performed as part of the Haplotype Reference Consortium (HRC).15 The genotypes were further refined by re-phasing using SHAPEIT3 algorithm.16 As a part of the joint calling quality control, only those sites that have a minor allele count of at least 5 copies in the entire data set (32 611 samples) went through additional filtering. Hence we have restricted the analyses to those variants with minor allele count ≥5. The BAM files have been submitted to the European Genome-phenome Archive (EGA). To minimize the batch effects, we performed these analyses on only those British samples from UK10K (1463 samples from 3781 samples) that were also sequenced at the WTSI. We have only included autosomal single-nucleotide variants for these analyses.

To determine the quality of the data, we compared the Finnish whole genome sequencing data with Illumina PsychArray genotypes for 629 individuals. We performed a two-step quality control for the chip genotyped data. The calls were first made using GenCall. We excluded the samples for gender mismatch and duplicates. Additional quality control steps were performed based on zCall data. All the samples with call rate <98% and heterozygosity >3s.d. were removed. Further, we performed SNP-wise QC to exclude variants with call rates <95% and Hardy–Weinberg P-value <10−6.

The filtered chip data were used for concordance analyses of the low pass whole-genome Finnish sequencing data using the GATK GenotypeConcordance module. From this comparison, we estimate that for variant sites with minor allele frequency >5% there is a non-reference sensitivity of 99.1% of variants with a low non-reference discrepancy of <0.1%. For the variant sites with minor allele frequency between 2 and 5%, we observe a non-reference sensitivity of 97.3% and non-reference discrepancy of 4.3%. For the variant sites between minor allele frequencies (MAF) 0.5–2%, the non-reference sensitivity is 93.9% and the non-reference discrepancy is 11.9%. Below MAF 0.5%, the number of variants were too low to calculate the genotype concordance.

Annotations

The various functional categories were obtained as follows:

(a) Coding sequence, promoters, untranslated region annotations were obtained from UCSC Genome Browser17 using the Gencode v19 gene models.18

(b) Coding variants were further stratified using the Variant Effect Predictor19 into loss-of-function variants, missense variants and synonymous variants. Polyphen20 predictions were used to classify missense damaging variants.

(c) Dnase1 hypersensitivity sites (DHS) were obtained from Trynka et al.21 We merged the coordinates for all cell types into one category.

(d) Conserved regions in mammals were obtained from Linblad-Toh et al.22 These were post processed by Ward & Kellis.23

(e) FANTOM5 enhancer coordinates were obtained from Andersson et al.24) Super-enhancers were obtained from Hnisz et al.25 The genomic coordinates were merged over all cell types.

(g) Transcription factor binding sites were obtained from Encode project.26

Enrichment analysis

We calculated the enrichment for each category beyond the baseline enrichment observed (enrichment calculated using all variants in Finns and Britons), assuming the following model.

Consider a category of variants in which we have observed F variants in the first population (eg, Finnish) and B variants in the second population (eg, British) and let M=F+B. Let s be the proportion of variants from the first population, and u the ratio of the numbers of variants in the first population to that in the second population. According to the binomial distribution, our point estimate for s is ŝ=F/M and has variance approximately ŝ(1−ŝ)/M. It follows that a point estimate for u is û=F/B, and the variance of log(û) is 1/(Mŝ(1−ŝ)) by the Delta method. This allows us to estimate 95% confidence intervals for û.

Suppose that we are comparing two categories of variants, and have observed F1 and B1 variants in category 1 and F2 and B2 variants in category 2. To test whether u1 is different from u2, we compute log(û1)−log(û2). Under the null hypothesis of no difference, this statistic has mean 0 and variance approximately (1/M1+1/M2)/(ŝ(1−ŝ)), where ŝ=(F1+F2)/(M1+M2), which we use to derive a P-value. Note that the standard proportion test between s1 and s2 gives essentially the same P-value.

We calculate the statistical power gained for the enriched variants. For quantitative traits, the standard linear model for genotype-phenotype association test statistic follows a chi-squared distribution with one degree of freedom and non-centrality parameter (NCP) of 2Nf(1−f)b2, where N is the sample size, f is MAF and b is the (additive) effect size of the minor allele measured on the scale of the phenotype. For case–control analysis, the corresponding NCP is 2Nf(1−f)r(1−r)b2, where N is the total sample size (cases+controls), r is the proportion of cases among all samples and b is the additive effect of the minor allele on the log-odds of the disease.27 Both NCPs are derived assuming that the variant explains only a little of the phenotypic variation at the population level, which is a reasonable assumption when the minor allele is rare and/or the allelic effect size is small.

Thus, for both quantitative and binary traits, the sample size N2 required in population 2 for the same power to detect an association as in population 1 is N2=N1 f1(1−f1)/(f2(1−f2)) assuming equal effect size and case proportion across the studies.

Results

Overall frequency distribution of genome-wide level variation

As a part of the SISu project, we sequenced the genomes of 1463 Finnish samples at low read-depth (average 4.6 ×) sampled across Finland. We compared these profiles to a sample of 1463 British individuals sequenced at average depth 7 × as a part of the UK10K Consortium.28 We restricted the analyses to 1463 individuals to minimize the artefacts arising from comparing data from different sequencing centers. Further, to reduce potential batch effects, these data sets were jointly processed as part of the Haplotype Reference Consortium.15 After stringent quality control steps, we compared the MAFs of 10 457 802 and 11 172 232 single-nucleotide variants (SNVs) identified with minor allele count 5 or greater in 1463 Finns and in the same number of Britons, respectively (Table 1).

Table 1 Summary of SNVs studied in Finnish and British samples

As a direct result of the bottleneck effect, we observed that Finns have significantly fewer rare variants (MAF<0.5%) compared with Britons (Figure 1). On the other hand, in Finns, we determined proportionally small but significant enrichment of low-frequency variants (MAF range between 2 and 5%, binomial P<2.2 × 10−16). The latter is also a direct effect of the historical bottleneck, followed by population growth. And as expected, we observed no differences in the number of common (MAF>5%) variants (Figure 1).

Figure 1
figure 1

(a) Allele frequency spectrum of variants across the whole genome in Finns compared with the Britons. The black line represents the ratio of the number of variants observed in Finns to those in Britons. (b) The number of variants seen in each population across the genome in different MAF bins. The lines in blue and red represent the number of variants for each bin observed in Finns and Britons, respectively.

For each frequency range, we also calculated the percentage of variants shared between both population samples. As anticipated, the number of variants observed as rare in Britons (MAF<0.5%) and also found as polymorphic in Finns was considerably lower than the opposite: only 54.7% of variants with MAF<0.5% in Britons were polymorphic in Finns while 72% of variants with MAF<0.5% in Finns were also polymorphic in Britons (Figure 2). However, for the MAF range of 0.5–5% the opposite was true: a lower proportion of variants seen in Finns were also polymorphic in Britons (eg, for 0.5–2% range, 84.9 and 94.3% of variants are shared, respectively). For common variants (MAF>5%), essentially all (99.9%) were observed to be shared in both directions (Table 1 and Figure 2).

Figure 2
figure 2

Variants shared between the two populations. The percentage of variants that are shared between the Finns and the Britons across different allele frequency bins. The histograms represent the allele frequencies of the shared variants in the other population for the MAF bin 2–5%.

Enrichment of variants across functional categories

We also calculated the relative enrichment of Finnish SNVs across various functional categories shown to be relevant in different phenotypic traits including disease.29 For each of these categories, we compared its distribution profile with that of the ‘expected’ whole genome baseline distribution in Finns (Figure 1a). Although there were several small deviations from the expected baseline in almost all functional categories, the greatest differences were consistently observed in the MAF range of 2–5% (Supplementary Figures 1–8).

In accordance with the latter observation, we compared the enrichment of different functional categories for MAF range 2–5% (Figure 3a). Across studied functional categories, the coding regions showed the highest enrichment in Finns (Figure 3a). More specifically, we observed >1.3-fold enrichment of loss-of-function (P=0.0291) variants and >1.1-fold enrichment of missense (P=0.0197) variants (Figure 3b), similarly as was demonstrated previously in Finns by exome sequencing.1 Furthermore, we observed consistent enrichment of rare and low-frequency (MAF≤5%) missense damaging variants (Figure 3b).

Figure 3
figure 3

Enrichment of variants across various categories. (a) Forest plot showing the enrichment across various functional categories for the variants in the minor allele frequency range 2–5%, where we observe consistent enrichment across most categories. The sizes of the boxes correspond to the size of each category and the black horizontal lines represent the 95% confidence intervals. Proportional enrichment is calculated compared with Britons. (b) Proportional enrichment of LoFs in Finns compared with Britons. The red line represents the ratio of the number of LoF variants in Finns compared to Britons. The black line shows the baseline enrichment observed across the whole genome. (c) Proportional enrichment of the number of variants in the conserved regions in Finns compared with Britons. The red line represents the variants common between conserved regions and the coding regions. The blue line represents the variants in the conserved regions but not in the coding regions. The black line shows the baseline enrichment observed across the whole genome.

As observed for the low-frequency variants (MAF 2–5%) in the coding regions, we found enrichment in the non-coding regions as well. In the non-coding parts of the genome, the promoter regions showed the largest enrichment compared with the expected baseline (P=0.012, Supplementary Figure 3), followed by the conserved non-coding regions of the human genome (P=0.01, Figure 3c). Although the Fantom5 enhancer regions showed proportional enrichment, it was not significant compared with the expected baseline (Figure 3a; Supplementary Figure 4). The other functional categories followed the baseline enrichment (Figure 3a). We also observed that, although enriched when compared with the Britons, the DHS and the super-enhancer elements are only marginally depleted beyond the expected bottleneck effects (PDHS=0.04 and Psuper-enhancers=0.007, Supplementary Figures 5 and 7).

MAF-enrichment of variants and effect on statistical power

We observed that 20.16% of all variants in Finns have minor allele frequencies elevated at least twofold. Furthermore, 1.36% of these variants were enriched ≥50 fold. For the proportionally enriched functional categories, we calculated the number of variants with elevated frequencies in Finns as compared to Britons (Table 2) and observed even higher MAF-enrichment for many of these categories. Missense damaging variants showed the highest enrichment with 37.98% variants showing minor allele frequencies at least twice as high as observed in the Britons. 29.71% of the loss-of-function variants showed at least twofold MAF-enrichment compared with the British sample.

Table 2 MAF-enrichment of variants in Finns and Britons

We also performed the same analyses in Britons. Across all categories, for variants that are enriched at the most 10-fold, the Britons consistently show much higher number of variants across all enriched functional categories. However, in the loss-of-function and missense damaging categories beyond 10-fold enrichment, the Finns have larger proportion of variants enriched. Interestingly, beyond 50-fold enrichment, the Finns have a relatively larger proportion of variants enriched across all functional categories (Table 2).

We extended these analyses to compare the enrichment in known GWAS loci30 and Clinvar variants.31 Similar to the above results, the Britons have a higher proportion of GWAS for variants that are enriched <50-fold (Table 2). However, the Finnish have a larger proportion of known associated loci for more than 50-fold enriched variants. Further, in the Finnish sequencing data, we observed 16 variants associated with FDH. Most of these were enriched at least fivefold. As a specific example, a variant in the AGA gene (c.488G>C; MAFFinns=0.0096; MAFBritons=0) is enriched 28-fold in Finns and is associated with aspartylglucosaminuria (OMIM #208400).

This enrichment of minor allele frequencies for certain variants boosts the statistical power to detect possible associations with traits and diseases. To quantify this gain in power, we calculated the number of samples required to detect association with high probability in Finns as compared with the Britons (Figure 4a). For variants that are twice as common among Finns to Britons, only half the number of individuals would be required to detect the associations in Finnish samples. Further, for variants with minor allele frequencies enriched 5x and 10x times, only 20 and 10% of the samples respectively are required to detect associations. These analyses indicate the gain in power for studying association analyses in isolated populations such as the Finnish population.

Figure 4
figure 4

Statistical power gained due to enrichment of a variant in Finns. (a) Plot showing the number of Finnish samples required to detect association if a variant is enriched twofold (blue), fivefold (purple) and 10-fold (yellow) in Finns as compared with Britons. (b) Regression coefficient (beta) desired to achieve a statistical power of 80% at genome-wide significance level as a function of minor allele frequency for a quantitative trait for variants enriched 10-fold in Finns. The red line indicates the betas in Britons and the blue line indicates the betas in Finns. (c) Odds ratio desired to achieve a statistical power of 80% at genome-wide significance level as a function of minor allele frequency for a case–control analysis for variants enriched 10-fold in Finns. The red line indicates the odds ratio in Britons and the blue line indicates the odds ratio in Finns.

To elucidate the power gained for variants enriched 10-fold in Finns, we have simulated an additive genetic model for both quantitative trait association and case–control association analyses (Figures 4b and c). For variants with MAF 0.1% in Britons, Finns have 80% statistical power to detect associations at genome-wide significance (α=5 × 10−8) with beta regression coefficients or ‘beta’ of ~1 s.d. (Figure 4b). Similarly, for the case–control scenario, Finns have 80% statistical power to detect association with odds ratio of ~2.5 with 5000 cases and 5000 controls (Figure 4c).

The increase in statistical power can be further exemplified by the missense variant PCSK9-R46L that is known to be associated with low density lipoprotein (rs11591147; MAFFinns=0.03862; MAFBritons=0.02016; β=−0.47). This variant is enriched 1.92 times in Finns. For this variant, we achieve 80% statistical power to detect an association at genome-wide levels of significance with 2415 Finns. However, with the same sample size in the Britons, we have only 19% power to detect the association. Similarly, for the splice site variant in the LPA gene (c.4974-2A>G), which is a protective variant against coronary heart disease (MAFFinns=0.03213; MAFBritons=0.003076; OR=0.84), 36 200 cases and 50 000 controls is required to achieve 80% power at genome-wide significance level in Finns. Using the same number of cases and controls in the Britons, there is 0.05% power to detect the association. Furthermore, the gain in statistical power can help to detect enriched genetic variants with modest effects associated with diseases that are present at a higher prevalence in Finland, as exemplified by a variant located in the intron of the RADIL gene (c.536-18508 T>A) and associated with intracranial aneurysms (rs150927513; MAFFinns: 0.0591; MAFBritons= 0.0021; RR=1.59).32

Sub-isolate of an isolated population

Amongst the sequenced Finnish individuals, 371 belonged to the Kuusamo sub-isolate within Finland. When comparing the SNV frequency profiles of these individuals against the same number of randomly selected non-Kuusamo Finns, we observed a significant reduction in the number of rare variants (Supplementary Figure 9). Although there was no overall enrichment of low-frequency variants, when looking at the variants stratified by their functional categories, we found a significant enrichment of LoF variants in the MAF 0.5–2% frequency range (P=0.0272; Supplementary Figure 10).

Discussion

Studying relatively recently bottlenecked and isolated populations or sub-isolates provides an excellent opportunity to discover disease-associated genes, as some of the underlying (and initially rare) variants can reach much higher frequency after the population bottleneck. We studied this bottleneck effect and subsequent enrichment of variants in Finnish samples by comparing them to outbred British samples. We demonstrated how the historical bottlenecks have affected the genetic landscape of Finns and the frequency profile of variants across the entire genome.

As expected, we observed no major differences in the common variant frequency spectrum – as most variants with MAF>5% probably segregated already tens of thousands of years ago, they are known to be relatively equally distributed in populations that separated more recently.33, 34 On the other hand, there was a significant depletion of variants in the rare frequency spectrum in Finns. Also, as an additional hallmark of the population bottleneck, a significant enrichment of low-frequency variants was observed (Figure 1). For most functional variants we observe an enrichment beyond the expected baseline showing that bottleneck population have a higher likelihood of accumulating deleterious and disease-associated mutations. To test the robustness of enrichment of low-frequency variants, we changed the minor allele frequency bins for the whole-genome analysis. We observed consistency in the enrichment of low-frequency variants (MAF 1–5%; Supplementary Figure 11). This phenomenon also explains the high prevalence of several monogenic Mendelian disorders, so-called ‘FDH’, caused by genetic disease variants found at much higher frequencies in Finland than in the rest of the Europe.12

We observed that within the frequency range of MAF 0.5–2%, only a subset (84.9%) of the variants in Finnish samples is also seen in the British samples (Figure 2). For the common variants, in contrast, most variants (99.9%) were shared between the two populations (Figure 2). These findings are similar to the patterns observed in the Icelandic3 and the Sardinian populations.2 Finns also show a similar enrichment of LoF variants and missense variants as seen in the Icelandic populations. However, the enrichment observed in the Icelandic population was found in the lower minor allele frequency range as opposed to the Finnish sample, possibly due to the differences in the historical bottleneck 'width', time since bottleneck (the Icelandic bottleneck was more recent than the Finnish), and the subsequent population growth rate. As such, this enrichment can provide a boost in statistical power when studying health-related phenotype traits affected by these enriched variants.

Other studies have recently demonstrated that functional categories such as conserved regions and Fantom5 enhancers contribute disproportionately more to the heritability of complex diseases, suggesting that in addition to coding regions also regulatory regions are enriched for trait and disease-associated variation.29, 35, 36 Here, we used 12 functional annotations to determine if variants in any of these categories are enriched beyond the baseline distribution of variants (and bottleneck effect) in Finns. We observed an enrichment across most functional categories in the low-frequency bin (MAF 2–5%). As reported previously,1 we observed a significant enrichment of low-frequency LoF and missense variants in Finns (Figure 3b). In addition to the enrichment of coding variants, however, also non-coding conserved regions and non-coding genic regions such as intron and promoter regions showed enrichment beyond the baseline bottleneck effect (Figure 3c). This enrichment likely appears due to selection against these variants in non-coding conserved regions and non-coding genic regions in outbred European populations. Furthermore, we see depletion for the super-enhancer regions and the DHS elements. This suggests that functionally, super-enhancers may be actually less active than regular enhancers, as was also proposed previously.29, 37

Previous studies have shown the utility of bottleneck populations to identify variants with elevated frequencies associated with diseases and phenotypes.1, 2, 13 Our findings show that across the genome, ~20% of all variants present in Finns have enrichment at least twice as observed in the Britons (Table 2). The percentage of variants with at least 2 × enrichment further increases for loss-of-function variants and missense damaging variants (29.71 and 37.98% respectively). Our power calculation simulations show that by testing for associations with these variants, the number of samples required to achieve significant detections are much lower (Figure 4a).

This power gain gives advantages particularly in identifying (i) rare variants with small/moderate effects, (ii) diseases that are not very common and large collection of cases-controls cannot be collected and (iii) investigation of quantitative phenotypes not measured in existing large biobanks. Examples of these include variants in AGA, PCSK9, LPA and RADIL genes. Sequencing studies combined with imputation of these enriched variants in large-scale Finnish population-based cohorts with rich phenotype data and leveraging on the national health registries data from Finland will likely have great potential to help identify similar novel genetic associations for complex disorders.

Although we tried to eliminate all possible sources of biases and other technical limitations by jointly processing our data sets, our results might be somewhat limited in the very rare variant spectrum. FINRISK and Health2000 cohorts have collected samples from all over mainland Finland. In this study, however, the samples have been geographically randomly selected. As low-coverage whole-genome sequencing data are sub-optimal for detecting variants observed only in a few individuals, rare variants observed in Britons were likely to be called more confidently compared with similar variants in Finns. In addition, the British data set had slightly higher coverage than the Finnish data (4.6x vs 7x), which may have had some effect on calling of the rare and low-frequency variants in Finns. Such technical limitations and differences may have led to under-estimation of our main findings (except the depletion of rare variants in Finns). Our comparison was limited against British samples and autosomal SNVs only, and future studies should therefore carry out comparisons against a panel of jointly processed heterogeneous population samples, including all types of variants (also from sex chromosomes). When comparing the Kuusamo sub-isolate sample to the Finnish non-Kuusamo individuals, we found that only LoF variants (that also showed the largest enrichment between Finns and Britons) appear significantly enriched. This is possibly due to the small sample size of the Kuusamo subset.

This study provides insights into the effects of a population bottleneck in various functional categories across the whole human genome. Obvious advantages of isolated populations are significantly reduced heterogeneity in genetic architecture, phenotype and the environment. The frequency of an originally rare allele that passed through the population bottleneck can be increased by several orders of magnitude (even >100-fold for some variants), after which it will decline relatively slowly (due to selective pressure). This phenomenon will therefore increase the statistical power to identify rare variants associated with complex disorders in both coding as well as non-coding regions of the human genome in isolated populations.1, 13