Introduction

Genotype imputation is a method for statistically inferring untyped genotypes in a sample of partially genotyped individuals, based on a reference panel of individuals who have been more densely genotyped or sequenced. Imputation methods attempt to identify haplotype sharing between individuals in the sample and in an imputation reference panel (IRP), and use this information to infer the alleles at untyped loci in the sample.1 Imputation allows geneticists to study variants that have not been directly genotyped in a sample and thereby to increase power and resolution of genome-wide association studies (GWAS). Imputation is particularly useful for combining association results across studies that used different genotyping arrays2 and facilitate fine-mapping to localise association signals by considering all genetic variants in a region.

Publicly available IRPs from the International HapMap Project3, 4 and 1000 Genomes Project (1000G)5 have been instrumental to the discovery of thousands of loci affecting diseases and traits in individual GWAS and collaborative meta-analyses. The first wave of studies mostly used the HapMap II IRP, which used microarray-based genotypes from 270 individuals at 3.1 million (M) variants.6, 7, 8, 9, 10 Later studies used IRPs based on the 1000G project, which performed whole-genome sequencing (WGS) on a diverse set of populations, with 2504 individuals and up to 84.4 M variants.11, 12, 13, 14, 15, 16 Although the latter IRP allows robust imputation of common variants (minor allele frequency (MAF)≥5%) and low-frequency variants (0.5≤MAF<5%)5 it has only limited imputation accuracy for rare (MAF<0.5%) variants.17, 18, 19 A recent IRP from Haplotype Reference Consortium (HRC)20 contains even more individuals (N=32 488, mostly with European ancestry) and should therefore enable better imputation of both low-frequency and rare variants in European samples.

Recently, several studies have demonstrated that the use of population-specific IRPs can further improve the imputation accuracy of common and low-frequency variants, and improve the imputation of rarer variants in the relevant population.21, 22, 23, 24 By using an IRP composed of related Dutch individuals, Deelan et al.23 showed that it is possible to substantially improve the completeness and accuracy of imputation of rare variants into a set of Dutch individuals. Gudbjartsson et al used long-range haplotype phasing in combination with imputation to increase imputation accuracy for rare variants down to MAF of 0.1% in the Icelandic population.22 Sidore et al. reported several variants associated with circulating lipid levels in Sardinians that were detected due to accurate imputation achieved by using a Sardinian WGS-based IRP; these authors showed that the variants would not have been identified if the analyses had been based on the 1000G IRP.24 Similar results were obtained in the UK10K project, where the British population-specific IRP combined with 1000G Project reference panel facilitated the discovery of several novel genetic variants associated with medically relevant phenotypes.19, 25, 26

Studies have shown that the genetic structure of European countries correlates closely with their geographic origin.27, 28 The Estonian population, being located in Northeast Europe, is genetically most similar to its neighbouring countries, including Finland, the North-western part of Russia, and other Baltic countries.28, 29, 30 Notwithstanding this overall genetic similarity, the Estonian population still has a substantial proportion of haplotypes that are not expected to be covered by the more diverse IRPs. Moreover, the population-specific differences are expected to increase as allele frequencies decrease.

In the current study, we first evaluated two most commonly used phasing algorithms to create population-specific IRP based on high-coverage (30 ×) WGS data from 2244 Estonian individuals. To impute low-frequency and rare variants more accurately in a specific population, one can take two approaches: (i) increase the size of IRPs from diverse populations to capture more reference haplotypes or (ii) employ population-specific IRPs. We assessed the utility of these approaches for improving imputation in Estonian samples by comparing the performance of (i) an Estonian-specific IRP, (ii) the commonly used 1000G IRP, (iii) the much larger HRC IRP and (iv) combinations of these panels.

Materials and methods

Cohort description

2304 geographically distributed individuals (selected randomly by county of birth) from the Estonian Biobank of the Estonian Genome Center, University of Tartu (EGCUT) were selected for WGS. EGCUT is a population-based biobank, containing almost 52 000 samples of the adult population (aged ≥18 years), which closely reflects the age, sex and geographical distribution of the Estonian population. A total of 6394 individuals (selected randomly and not overlapping with WGS data set) from the Estonian Biobank were selected for genotyping using Illumina HumanCoreExome (Illumina, San Diego, CA, USA) array, whereas the subset of 505 of these individuals were also subject to whole-exome sequencing (WES).

WGS and WES sequencing and variant calling

WGS samples followed a PCR-free sample preparation. Libraries sequenced on the Illumina HiSeq X Ten (Illumina, San Diego, CA, USA) with the use of 150 bp paired-end reads to 30 × mean coverage with a median insert size of 400 bp±25%. WES samples DNA was enriched for target sequences (Agilent Technologies, Santa Clara, CA, USA; Human All Exon V5+UTRs) according to manufacturer’s recommendations.

Sequenced reads were aligned to the GRCh37/hg19 human reference genome using BWA-MEM31 v0.7.7. SAMtools32 v1.2 was applied to compress SAM to BAM (samtools view), sort (samtools sort) and index BAM (samtools index) files. PCR duplicates were then marked using Picard (http://broadinstitute.github.io/picard) v1.136 MarkDuplicates.jar. For further BAM improvements, including realignment around known indels and base quality score recalibration, we applied Genome Analysis Toolkit (GATK)33, 34 v3.4 (v3.4-46). Single-sample genotypes were called by GATK HaplotypeCaller algorithm (-ERC GVCF). All gVCF-files were combined (-T CombineGVCFs) and jointly called (-T GenotypeGVCFs).

Quality control

Out of the total 2304 WGS samples submitted for sequencing, 4 samples did not have enough input DNA (<1.2 μg), 7 samples failed in library preparation three times and 9 samples had a contamination rate >10%. Thus, variants of 2284 WGS samples were jointly called. The GATK Variant Quality Score Recalibration was used to filter variants with a truth sensitivity of 99.8%. Also, variants with GATK inbreeding coefficient less than −0.3 were filtered to remove sites with excess heterozygous individuals. Only PASS sites were considered in the further analysis.

The PLINK/SEQ (https://atgu.mgh.harvard.edu/plinkseq) v0.10 i-stats module was used to calculate number of variants (NVAR), number of non-reference (NALT) variants, number of heterozygous (NHET) variants, NHET/NALT ratio, transition/transversion (TITV) ratio per sample and outlier (below or above 3 SD from the population mean) samples were removed. In addition, genotype and phenotype sex concordance was checked for each sample and outliers were removed. The final WGS sample set contained 2244 individuals. The final WES sample set, which passed all quality control filters and was genotyped with Illumina HumanCoreExome array, contained 505 individuals.

Multi-allelic SNVs were removed and we further excluded variants with call rate <0.95, minor allele count ≤2, Hardy–Weinberg equilibrium test P-value<1 × 10−6 and variants in low-complexity regions.35

Genotype array data was filtered sample-wise by excluding on the basis of call rate (<98%), extreme heterozygosity (>mean±3 SD), genotype and phenotype sex discordance, cryptic relatedness (IBD>20%) and outliers from the European descent from the MDS plot in comparison with HapMap reference samples. SNP quality filtering included call rate (<99%), MAF (<1%) and extreme deviation from Hardy–Weinberg equilibrium (P-value<1 × 10−4). Non-autosomal SNPs were excluded from the analysis.

Haplotype phasing

The EGCUT WGS data was phased with SHAPEIT2(ref. 36) (r837), using four computer cores. Pre-phasing of genotype array data was made in similar manner using SHAPEIT2 using four cores. As a separate test for pre-phasing accuracy, we used chromosome 20 sequence of 2244 full genomes, which were filtered beforehand to exclude any non-founder family members and individuals with a genome-wide PI_HAT value above 0.5 (2195 individuals remained) when compared to other individuals in the data set. To assess the efficiency of various approaches to phasing of WGS data, we applied two different tools: SHAPEIT2(ref. 36) and Eagle2.37, 38 Both programs were engaged with the default parameters with varying number of cores (1, 2, 4, 8, 16, 24 and 32). To verify the phasing accuracy for other data sets, the 1000G data was phased using a similar pipeline (1, 8 and 32 cores).

In addition to the regular phasing functionality, the read-aware phasing capability of SHAPEIT2(ref. 39) was also assessed. The first step entailed creating a phase informative read file on the basis of BAM files, using the module ExtractPIRs v1 (r68) with default parameters provided by the authors. After the generation of phase informative reads, the obtained file could be used in a similar fashion to a map file as a reference point for SHAPEIT2 to phase the data sets. Phasing was performed in three parallel runs after which the average run time and accuracy were compared as indicators of phasing quality.

Phasing accuracy was defined as the number of switch errors present in the phased data set. For this, the phased founder genotypes were compared with the non-phased genotypes of their offspring to determine the heredity pattern of heterozygous positions, any shifts in heredity from one parental haplotype to another were counted as switches. Two families with one offspring and two families with two offspring were used to estimate switch error rate in EGCUT sample set, four families with one offspring were used for 1000G sample set. The ratio of switch errors was calculated by dividing the number of haplotype switches to the number of the heterozygous positions where the occurrence of the switch can be reliably determined, after which the results were averaged across the trios.

Genotype imputation

Imputation using EGCUT and 1000G reference panels separately and in combination were performed in High Performance Computing Center, University of Tartu using IMPUTE2 with default parameters. As IMPUTE2 allows to use two-phased reference panels in combination (the ‘imputation with two phased reference panels’ option), we used the EGCUT and 1000G reference panels also together (EGCUT+1000G and 1000G+EGCUT). In case of such panel combining, IMPUTE2 imputes only genotypes for variants that are present in the first (main) panel but in the process, uses additional haplotype information from the second panel to improve the imputation accuracy through larger set of reference haplotypes.40

Imputation with the HRC panel was carried out using IMPUTE2 with default parameters except that the k_hap parameter that was set to 1000.

For all imputation panels, monomorphic SNVs were excluded. No further filtering was performed based on IMPUTE2 info score, but most of the analyses rest on well-imputed (INFO>0.4) and confidently imputed (INFO>0.8) SNVs.

Post-imputation filtering and concordance analyses

The GATK GenotypeConcordance tool was used to calculate imputation accuracy (concordance, non-reference sensitivity and non-reference discordancy) for different imputation panels with WES data for overlapping individuals (N=505) used as the gold standard. Low-complexity regions were filtered out of WES data prior to analysis. PLINK v1.9 was used to convert IMPUTE2 files (imputation output) to VCF format using hard-call threshold 0.9. BCFtools filter option was used to keep genotypes imputed with INFO-value>0.4 and overlapping with WES-target regions. Comparison was performed in three MAF bins (MAF≥5%, 0.5≤MAF<5% and MAF<0.5%) based on WES minor allele frequencies and only well-imputed (INFO>0.4) SNVs were considered. Reference sequence in the concordance analyses was the same for both WGS and WES analysis pipelines.

To assess more stratified imputation accuracy, an additional concordance analysis was run for IRPs for well-imputed (INFO>0.4) variants in WES-based MAF bins of (0, 0.2), (0.2, 0.4), (0.4, 0.6), (0.6, 0.8), (0.8, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 10), (10, 20), (20, 30), (30, 40) and (40, 50%).

Functional annotation of variants

We used Variant Effect Predictor41 version 84 to annotate the confidently imputed variants in the 20 345 protein-coding genes in the Ensembl database (with Gencode v19 on GRCh37).

Results

Using high-coverage WGS data of 2244 Estonian individuals from the Estonian Biobank,42 we created a population-specific IRP. After variant calling and rigorous quality control steps (Materials and Methods), we phased the Estonian WGS data and used the resulting Estonian IRP (referred to here as the EGCUT IRP, for the Estonian Genome Center at University of Tartu), together with the 1000G and HRC IRP, to impute genotypes into 6394 Estonians who had been genotyped on microarrays.

Phasing speed and accuracy of multi-threaded haplotype phasing

Haplotype phasing can be a time-consuming process, especially for large WGS-based data sets. We therefore began by evaluating haplotype-phasing algorithms. We compared three different parallel, multi-threaded computational programmes—SHAPEIT2,36 SHAPEIT2-RA (for read-aware)39 and Eagle2(refs. 37, 38) —utilised with different number of processor cores (1, 2, 4, 8, 16, 24 and 32) (Supplementary Figure 1A). These programmes were applied to data from chromosome 20 in the EGCUT samples. Accuracy was assessed by counting the number of haplotype switch errors (Materials and Methods) in four families, for which haplotype phase could be independently determined based on segregation of genetic markers.

While the speed of both SHAPEIT2 and SHAPEIT2-RA increased in proportion to the number of cores used, the speed of Eagle2 increased proportionally up to eight cores but not beyond. Up to this point, Eagle2 was considerably faster than SHAPEIT2, by a factor of roughly 6-fold. The two versions of SHAPEIT2 showed similar accuracy, which was slightly lower for Eagle2 (average haplotype switch error rate of 0.7% with SHAPEIT2 vs 0.81% with Eagle2; Table 1). In all cases, the accuracy did not vary significantly with the number of cores used. To validate that these results were not population-specific, we performed similar analyses with four 1000G family trios (with 1, 8 and 32 cores) and observed similar switch error rates in the corresponding phasing results (Supplementary Table 1). While in our hands, SHAPEIT2 displayed slightly higher accuracy, it did so at the cost of increased computing time, making Eagle2 a viable option for the researchers who require time-efficient phasing of large data sets. However, because the 1000G and HRC IRPs were phased with SHAPEIT, we used this program computer program to phase the EGCUT data (Materials and Methods).

Table 1 Phasing speed and accuracy to phase chromosome 20 of the EGCUT WGS data

Genotype imputation

To impute genotypes into 6394 Estonian individuals who had been genotyped on Illumina HumanCoreExome microarrays, we used the IMPUTE2 software43, 44 together with three separate IRPs and two combinations of IRPs (Table 2). The first IRP consisted of the 2244 whole-genome sequenced EGCUT individuals; these individuals were selected to be geographically distributed across Estonia and did not overlap with the set of genotyped individuals. The other two were 1000G IRP and the HRC IRP from large diverse populations. The IMPUTE2 software also allows to improve imputation accuracy by using two reference panels simultaneously by pooling haplotype information across both IRPs.40 We used both combinations of the EGCUT and 1000G panels with that option: EGCUT+1000G and 1000G+EGCUT. In such combinations, IMPUTE2 imputes only genotypes for variants that are present in the first (main) IRP while also considering haplotype information from the second IRP to improve the imputation accuracy through larger set of reference haplotypes. Thus, EGCUT+1000G should be viewed as an improvement of the EGCUT reference panel (genotypes observed in the EGCUT panel imputed while considering haplotypes inferred from the EGCUT and 1000G panels) and 1000G+EGCUT should be considered as an improvement of the 1000G panel (genotypes observed in the 1000G panel imputed while considering haplotypes inferred from both panels).

Table 2 Description of compared IRPs

Number of imputed variants

For each IRP, we studied the number of imputed single-nucleotide variants (SNVs) as a function of the imputation confidence estimate—INFO-value—assigned by the IMPUTE2 programme. The INFO-value reflects the information in imputed genotypes relative to the information if only the allele frequency were known.43, 44 We counted the total number of imputed SNVs, the number of ‘well-imputed’ SNVs (INFO>0.4)18 and the number of ‘confidently imputed’ SNVs (INFO>0.8). We also counted the number of imputed SNVs found only with each IRP (Figure 1a).

Figure 1
figure 1

Number of variants imputed from different IRPs. (a) Number of all shared and panel-specific variants in three distinct reference panels imputed with INFO-value >0.4 (in bold) and >0.8 (given in brackets); (b) Total number of imputed SNVs (bars); the number of SNVs imputed with imputation quality score (INFO-value)>0.4 (coloured) and INFO>0.8 (shaded areas).

Although the number of total variants and well-imputed variants obtained with the larger diverse panels (1000G and HCR) exceeded the corresponding numbers for the population-specific panel, the situation was reversed for confidently imputed SNVs with 12.29 M (75% of total number of imputed SNVs), 10.05 M (48%) and 9.44 M (27%) of SNVs being confidently imputed with the EGCUT, HRC and 1000G panel, respectively (Figure 1b). The combined EGCUT+1000G panel showed almost identical results to EGCUT panel alone, whereas the 1000G+EGCUT panel showed considerable increase in the number of confidently imputed SNVs (by considering additional haplotype information from the population-specific IRP) as compared to the 1000G panel alone. These results indicate that using a population-specific IRP increases the number of confidently imputed variants, due to more similar allele frequencies and greater relatedness between the samples and the IRP. More diverse IRPs have a tendency to employ incorrect allele frequency distribution and also to contain divergent haplotypes, which are not present in the samples (eg., African haplotypes carrying variants that are not polymorphic in non-African populations).

We next stratified these analyses according to the MAFs of the imputed SNVs, dividing them into three groups: common (MAF≥5%), low-frequency (0.5≤MAF<5%) and rare (MAF<0.5%) SNVs. For common variants, the number of imputed SNVs was very similar across the IRPs (Figure 2). For low-frequency variants, the number of well-imputed SNVs was also very similar, whereas the number of confidently imputed SNVs was larger for the population-specific IRP. For rare variants, the results were even more pronounced, 3.48 M (54% of well-imputed rare variants), 2.54 M (33%) and 1.86 M (15%) SNVs were imputed confidently from the EGCUT, HRC and 1000G panels, respectively (Figure 2b,Supplementary Table 2). Notably, the EGCUT panel outperformed the other panels on rare variants despite the fact that the HRC panel contains the largest number of haplotypes (64 976) and the 1000G panel contains the largest number of variants (81 M SNVs on autosomes).

Figure 2
figure 2

Number of common (MAF≥5%), low-frequency (0.5≤MAF<5%) and rare (MAF<0.5%) variants imputed from different IRPs. (a) Number of well-imputed SNVs (imputed with imputation confidence INFO>0.4); and (b) number of confidently imputed SNVs (imputed with imputation confidence INFO>0.8).

These results show that imputation confidence (measured as INFO-value) decreases substantially as the allele frequency of the imputed variants declines (Supplementary Figure 2). Despite the fact that the larger and more diverse IRPs contained more variants, they contained fewer matching haplotypes than the population-specific panel. As a result, the HRC and 1000G panels yielded genotypes imputed with lower confidence (INFO-value), especially for rare SNVs (Supplementary Figure 3). For the combinations of reference panels, the EGCUT+1000G showed almost identical results in every aspect compared to EGCUT panel alone, while the 1000G+EGCUT panel showed a slight gain for common and low-frequency variants and a substantial gain for rare variants when compared to 1000G panel alone (Figure 2).

Imputation of loss-of-function and missense variants

Loss-of-function (LoF) variants that disrupt protein-coding genes and missense variants that cause amino acid changes are of particular interest because they are potentially clinically relevant. Considering only confidently imputed SNVs (INFO>0.8), we observed that all three reference panels enabled imputation of a similar number of common LoF and missense variants (Figure 3). However, the number of low-frequency LoF variants was higher with the population-specific IRP and the number of rare LoFs was almost twice as high (417, 439 and 730 LoF SNVs with the 1000G, HRC and EGCUT, respectively; Supplementary Table 3) with the population-specific IRP.

Figure 3
figure 3

Number of common (MAF≥5%), low-frequency (0.5≤MAF<5%) and rare (MAF<0.5%) LoF (a) and missense (b) variants imputed from different IRPs with INFO-value>0.4 (bars) and INFO-value>0.8 (shaded areas).

Imputation sensitivity and accuracy

Although imputation confidence estimates (such as INFO-values or squared correlations r2)45, 46 are useful for characterising the overall success of the imputation process, high INFO or r2 values do not guarantee that the corresponding genotypes are inferred correctly. Therefore, it is important to directly assess the accuracy of the imputed genotypes. We compared the ‘best guess’ genotypes imputed from the different reference panels to WES data available for a subset of imputed EGCUT individuals (N=505; Supplementary Figure 1B). Treating these WES-based genotype calls as ‘gold standard’, we calculated two metrics for each imputed data set: (i) sensitivity, defined as the proportion of WES-based non-reference (NR) variant calls that were also obtained through imputation process; and (ii) discordancy rate, defined as the proportion of imputed SNVs that had incorrect genotype call.

For well-imputed common SNVs, all of the IRPs gave similarly high sensitivity (88.5–92.4%) (Figure 4a). For low-frequency SNVs, the three panels that included data from the population-specific panel (EGCUT, EGCUT+1000G, and 1000G+EGCUT) yielded in higher sensitivity (~87%) than the more diverse panels (78% and 76% for HRC and 1000G, respectively) (Table 3). For rare SNVs, the proportional difference was even greater (40%, 42% and 49% for 1000G, HRC and EGCUT IRPs, respectively).

Figure 4
figure 4

Imputation accuracy for common (MAF≥5%), low-frequency (0.5≤MAF<5%) and rare (MAF<0.5%) well-imputed variants (INFO>0.4) imputed from different IRPs. (a) Non-reference (NR) sensitivity—proportion of whole-exome sequencing (WES) based NR variant calls that were also retrieved through imputation process. (b) NR discordancy rate—proportion of NR variants that were retrieved through imputation process but had incorrect genotype calls as compared to the WES genotypes.

Table 3 Genotype concordance of well-imputed SNVs (INFO>0.4)

Similarly, the population-specific IRP performed better with respect to discordancy rate (Figure 4b). Whereas all three panels had a low discordancy rate for common variants (1.9–3.4%), the EGCUT panel outperformed other panels for low-frequency and rare SNVs (Table 3). Notably, one-quarter (24.7%) of rare SNVs imputed from the 1000G IRP had incorrect genotype calls, whereas the proportion was substantially lower with the EGCUT IRP alone (14.1%) or if it was used in combination with the 1000G panel (13.6% and 14.3% for the EGCUT+1000G and 1000G+EGCUT panels, respectively). Similar results were seen for confidently imputed variants, for which both sensitivity and discordancy rate were better in case of the population-specific reference panel (Supplementary Figure 4, Supplementary Table 4). The better performance is due to a close match between the EGCUT IRP and Estonian samples—owing to the fact that rare variants tend to be more recent and thus more population specific.

We repeated these analyses of imputation accuracy by using finer bins of MAF (Supplementary Figures 5–9). We found that although the overall success of genotype imputation of well-imputed variants decreased steadily with MAF in case of all compared IRPs, imputation accuracy was, especially for rare variants, significantly better in case of the population-specific IRP (Supplementary Figure 7) or if it was used together with the 1000G reference panel (Supplementary Figures 8 and 9).

Discussion

Genotype imputation is a cost-efficient way to improve the power and resolution of GWA studies. Although large IRPs from diverse populations work reasonably well for imputation of common and low-frequency variants, currently available reference panels allow only limited imputation of rare variants.

WGS has become increasingly widespread in recent years and is increasingly used in creating IRPs. The first step in the process of creating an IRP is the correct assignment of polymorphic positions regarding the individual haplotypes. Although the task can be computationally demanding for large data sets, the advent of various phasing algorithms has simplified this task considerably. We compared the performance of the SHAPEIT2 and Eagle2 software, both of which can increase the phasing speed by dividing the phased reference data set into multiple subsets, which are then processed in parallel. Similarly to previously published comparison,38 we found that Eagle2 was considerably faster than SHAPEIT2. However, the decrease in phasing time resulted in a small increase in haplotype switch errors, making SHAPEIT2 a better choice for those aiming at the highest accuracy. Interestingly, we did not observe a difference in phasing accuracy between SHAPEIT2 and SHAPEIT2’s read-aware mode. It is possible that this was due relatively homogeneous nature of our Estonian samples and that the SHAPEIT2 read-aware mode may exhibit advantages for more heterogeneous data sets.

Consistent with previous studies, our results show that population-specific IRPs can improve the genotype imputation, especially for low-frequency and rare variants.21, 22, 23, 24 By being genuinely reflective of the study data set, population-specific IRPs can therefore facilitate discovery of true associations in GWAS and subsequent fine-mapping of causal variants, as demonstrated by others24, 47, 48 and also with the Estonian population-specific reference panel.49

Although the large IRPs from more diverse populations led to the imputation of a larger number of rare SNVs, a large proportion of these genotypes were imputed with low imputation confidence (IMPUTE2 INFO-value). Focusing only on confidently imputed SNVs, the population-specific IRP outperformed the 1000G and HRC IRPs. Although the overall imputation success and accuracy depend on several different factors (including the size of the IRP and the genetic structure of the reference panel and the genotyped sample), these observations are expected to apply to other populations with similar genetic background.

Beyond imputation quality, we also considered sensitivity and discrepancy rate of the imputed genotypes. We found that the population-specific IRP outperformed the large IRPs from diverse populations—a finding that is also in line with other recent imputation accuracy comparisons.50 Using a large IRP that is not well matched in terms of ancestry can thus not only limit the discovery of associations in GWAS as observed previously24 but also introduce variants that are not actually polymorphic in the imputed sample.50

Because short insertion-deletion (indel) variants were not part of the HRC IRP and because calling indel variants is still more error-prone than SNV calling, we did not include indels in our IRP and our comparisons. Once technical limitations related to indel calling and phasing are resolved, indels should be included in all IRPs.

In conclusion, we observe that, although currently publicly accessible large diverse IRPs like 1000G and HRC enable imputation of many low-frequency and rare variants in the Estonian population, most of these variants are imputed with relatively low confidence and furthermore, there is a significant proportion of population-specific variation that cannot be imputed from these panels. Moreover, imputation of low-frequency and rare variants is considerably more accurate with a population-specific reference panel or if one is used in combination with a publicly available reference such as the 1000G panel. Our results also suggest that, given that the population-specific reference panel size (number of haplotypes) is comparable to the 1000G panel size, the previous observations that reference sample size is more important than precise population matching does not apply equally well to all populations and population-specific panels can outperform even an order of magnitude larger but more diverse reference panels.