Introduction

Genotype imputation is now common practice in Genome wide association (GWA) analysis1,2. Imputation facilitates meta-analyses of studies genotyped at different platforms3,4,5 and is supposed to increase the power of GWA analyses6. It is also used for fine mapping efforts7. Moreover, genome-wide DNA sequencing is still cost-intensive. Sequencing a part of the population and imputing the other individuals using the sequenced samples as reference is therefore a recommended strategy8.

Different reference panels of densely genotyped individuals are available and are used as templates of the haplotype structure for the target data sets9,10,11,12,13,14. For example, HapMap provides publicly available reference panels containing individuals with ancestry from West Africa, East Asia and Europe10,11. The latest generation of the HapMap reference panel10 is known as “HapMap3” and includes about 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 populations. Thereby, ten 100-kilobase regions in a subset of these individuals were sequenced. Another relevant reference panel is phase3 of the 1000Genomes project9,13. This dataset comprises a haplotype map of 80 million single nucleotide polymorphisms from 2,504 individuals derived from 27 populations. These reference panels are continuously improved both in sample size, density and quality.

Although genotype imputation is a well-established technique, algorithms and methodological processes are continuously refined. To deal with large reference panels, new imputation frameworks and methods were developed for faster computation. Among these, imputation with pre-phasing of the target dataset is the most popular method currently in use. This strategy is implemented in the frameworks MaCH15 plus Minimac (MaCH-Minimac) and IMPUTE27 plus SHAPEIT16 (SHAPEIT-IMPUTE2)17. Research on imputation relying on pre-phasing strategies claimed that this method results in comparable accuracy compared to no pre-phasing17. In the present paper, we aim at verifying this claim by comparing its performance with MaCH-Minimac using the POPRES dataset. Moreover, these two frameworks were further compared with those not relying on pre-phasing, namely MaCH, MaCH-Admix18, and IMPUTE2.

Another issue during imputation is how to deal with the continuously increasing amount of mixed ethnicities in large epidemiologic studies. This has raised the question to what extend genotype imputation accuracy may be affected by reference panels which do not exactly match with the ancestry of the target populations. To address this issue, imputation algorithms were further refined so that they can adopt reference panels with individuals from multiple populations. This is done by letting the software choose a “custom” reference panel either in a piecewise manner or for the whole genome. Utilizing recent releases of reference panels, different approaches for the selection of appropriate combined reference panels are discussed: Creating a cosmopolitan reference panel by selecting haplotypes from all of the available reference populations19,20,21,22, constructing a reference panel by weighted combination strategies23,24, by principal component clustering25, or by selection based on identity-by-state(IBS)18,20.

Several software packages are designed to deal with admixed populations. Here, we consider three of the most popular methods: IMPUTE2, SHAPEIT-IMPUTE2 and MaCH-Admix. All these three programs implement an IBS-based strategy for selecting an appropriate reference panel. In contrast to IMPUTE2 or SHAPEIT-IMPUTE2, this is done in a piecewise manner by MaCH-Admix. We compare these three programs with the software frameworks requiring homogeneous populations as reference panel: MaCH and MaCH-Minimac. In summary, we compare a total of five imputation frameworks to assess, how pre-phasing and usage of admixed reference panels affect imputation accuracy in a variety of populations (POPRES26). An extensive simulation study was performed for this purpose.

Since sample size of the POPRES panel is small, we studied the dependence of our comparisons on sample size in a larger data set of a population based study of Germany.

Materials and Methods

Datasets

We considered subsamples of different ethnic origins taken from a large set of Population Reference Samples (POPRES)26. We obtained the POPRES dataset from dbGaP27 through dbGaP accession number phs000145.v4.p2. Genome-wide genotyping of these individuals was performed on the Affymetrix (Mountain View, CA) GeneChip 500K Array set with the published protocol for 96-well-plate format. For our simulations study, we considered data of chromosome 22 consisting of 5,637 SNPs. As target sets for imputation, we selected a total of 20 populations for which at least 40 individuals were available. If more than 40 individuals were available, a random subset of N = 40 was selected. Among these populations, 15 were of Caucasian origin: Australian, Canadians, German, French, Swiss-French, Swiss-German, Swiss, Italian, Spanish, Irish, British, Belgian, Portuguese, individuals from former Yugoslavia, a mixed group of east European origin (i.e. a mixture of people from Czech-republic, Hungary, Poland); two populations of South-Asian origin: Indians and Punjabis, one east-Asian population: Japanese, one Mexican population: Mexican, and finally, a mixed-population of African-Americans (AfAm). Since the POPRES subsets contained only small numbers of individuals, we also considered a larger German data set of 2,500 individuals of the LIFE-Adult study, a population-based study carried out in the city of Leipzig. Study design is described elsewhere28.

Quality Control and Masking of SNPs

The original POPRES data was based on the Genomic assembly Affymetrix release 25 NSP25 and STY25 with dbSNP Build 126, released on May 2006. However, the reference panel HapMap3 contains rsIDs and corresponding Affymetrix IDs are annotated with dbSNP build 128. Therefore, it was necessary to match the annotation of the variant names and strand orientation. Strand-matching was performed using “fcGENE”29. SNPs with ambiguous strand information were removed. 1,014 SNPs could not be matched and were excluded resulting in a total of 4,623 SNPs eligible for analysis.

The major idea of our simulation study is to define high quality (HQ) SNPs assumed to express true genotypes. These SNPs will then be masked, re-imputed and compared with the original genotypes to assess imputation accuracy. We aimed at masking a reasonable number of HQ SNPs for which imputation quality can be assessed without thinning out the linkage disequilibrium structure too much. Moreover, we prefer to mask common variants which are more informative regarding comparisons of true and imputed genotypes. Therefore, we applied the following SNP filter in order to define HQ SNPs: call rate (CR ≥ 95%), minor allele frequency (MAF ≥ 0.1) and p-values of Hardy Weinberg Equilibrium Test p(HWE) ≥ 0.01. For the latter, we applied an exact stratified test of HWE calculated over all POPRES populations considered30. Overall 457 SNPs passed these quality criteria in all data subsets.

Imputation quality of a SNP depends on the number of missing SNP (denoted as missingness here). To assess the impact of the degree of missingness, different percentages of HQ SNPs were masked, namely 50%, 70% or all. To ensure comparability, SNPs masked in the scenario of 50% missingness are also masked in the scenario of 70% missingness and so on.

To study the effect of sample size, we considered 2,500 samples from LIFE-Adult. Genotyping was performed using the Affymetrix Axiom CEU array. Affymetrix power tools with standard settings were used for primary SNP calling. Samples were filtered by the following criteria: dish QC < 0.82, call rate < 0.97, sex mismatch, implausible relatedness issues and PCA outliers (6 SD). SNPs were filtered by the following criteria: call rate < 0.97, Affymetrix cluster measures as recommended (FLD, HetSO and HomRO), number of minor allele < 3, deviation from Hardy-Weinberg equilibrium (p < ), plate association (p < 10−7) and minor allele frequency.

For the analysis, we considered 2,474 SNPs in a 10 mega bases area of chromosome 22. HQ-SNPs are defined by MAF > = 0.2, p-value of exact Hardy-Weinberg test > = 0.5, call rate > = 0.995. A total of 522 SNPs fulfilled these criteria and were masked and re-imputed accordingly. To study the impact of sample size, we considered randomly chosen subsets of the original data set of sizes 2500, 1000, 500, 250, 100 and 40. Here, the larger data set always contains the smaller one.

Reference Panel

In HapMap project, genotyping was performed directly, while the 1000 Genomes dataset relies (at least partly) on low depth whole genome sequencing data. Therefore, HapMap has still the higher accuracy and was chosen as reference panel for the present study10,11. Here, we used the pre-formatted HapMap3 reference panel. Imputation with MaCH and MaCH-Minimac were performed using the reference panels that were best matched with the ancestry of the target population. This strategy was considered as standard to compare its results with those of MaCH-Admix, IMPUTE2 and SHAPEIT-IMPUTE2, the frameworks which adopt admixed reference panels. Appropriate reference panels: CEU, YRI, MEX and JPT + CHB provided by MaCH software developers through their homepage31 were used for imputing the target data sets. The best matched reference was selected by minimizing the genetic similarity measure Nei’s GST between the target populations and available reference panels as recommended elsewhere32.

IMPUTE2 uses a mixed cosmopolitan reference panel collected from a variety of sampling locations in Africa, Asia, Europe and America. It automatically selects a ‘custom’ reference panel separately for each individual during imputation. We downloaded the mixed reference panel created from the samples of the HapMap3 project available at the IMPUTE2 website33 and used it for our purposes. This mixed reference panel consists of haplotypes of a total of 1,011 individuals genotyped on 20,084 SNPs at chromosome 22. Since our aim is to compare IMPUTE2 and MaCH-Admix, we used the same mixed reference panel by converting the reference of IMPUTE2 to MaCH-Admix format using fcGENE29.

Due to the fact that the overlap of HapMap3 and Axiom CEU array was rather small, we decided to impute our LIFE-Adult samples with 1000 Genomes reference (Phase 1 Release V3)34,35.

Imputation

Imputation was performed separately for each data subset using five different imputation frameworks with or without pre-phasing or usage of admixed reference panels. Table 1 compares the frameworks regarding these options.

Table 1 Imputation Frameworks analysed: Frameworks differ with respect to usage of pre-phasing or admixed versus specific reference panels.

For imputation with MaCH, version 1.0.18.c, we first estimated imputation error rate and recombination rate in the haplotype panels by running the “greedy” algorithm for 30 iterations. These two model parameters were then used to determine the posterior probabilities of each genotype in the second step15. MaCH calculates the software specific measure “Rsq” to assess imputation quality15.

To perform imputation with MaCH-Minimac, we first determined the haplotypes of target data sets using MaCH software. Then the pre-phased data were imputed with Minimac, version Minimac2 from 2014.9.15.

For imputation with IMPUTE2, version 2.3.1 was used with default parameters. We performed imputation by splitting chromosome 22 in 6 chunks of equal size 5.711 MB as recommended33. This can be done by providing the lower and upper boundaries of base pair position with IMPUTE2 command option “-int”. Format conversion and IMPUTE commands including the lower and upper boundaries of each chunk were generated by fcGENE29. The population-genetic model used by IMPUTE2 requires an effective population size as input parameter. Although different human populations have different effective sizes, IMPUTE software providers recommend a large value of about 20000 for the parameter “−Ne” as universal value through which they achieved high accuracy across all population groups. To avoid margin effects while chunking genotypic region, IMPUTE2 uses an internal buffer region (default is 250 kb) on either side of the analysis interval33. Imputation processes were run in a parallel way to speed up the computational runtime. At the end of each computation, we extracted the imputation quality scores. As suggested by the software providers33, the best strategy for imputing genotype data with IMPUTE2 is first to phase the study population with SHAPEIT16,36 and then impute the phased data with IMPUTE2. We followed this strategy denoted as “SHAPEIT-IMPUTE2” (using SHAPEIT version v2 r790) in the following.

For imputation with MaCH-Admix18, version v2.0.203, we used the integrated default run mode where model parameters like recombination rate and error rates are automatically determined before calculating genotypes and imputation quality. Admixed reference panels used for MaCH-Admix were created from corresponding IMPUTE2-formatted reference panels which were downloaded from the home page of IMPUTE233. We also used the implemented two step method of MaCH-Admix which is similar to those of MaCH. Results were similar to those of the default strategy (not shown). All software-specific commands are provided in the supplement material S1.

Measures of imputation accuracy

Direct comparison of true and imputed genotypes: Although, imputation software usually provide measures of imputation accuracy, these measures typically are software specific, hampering comparisons across software. To circumvent this issue, we masked good quality SNPs and re-imputed them allowing an objective assessment of imputation accuracy. Comparisons of true genotypes and imputed genotype distributions were performed in the following ways: First, we compared the original true genotypes of masked HQ SNPs with corresponding best-guess genotypes. For this type of comparison, we also analysed the posterior probabilities of both, the correctly and incorrectly imputed best-guess genotypes. In another approach, we compared true genotypes with estimated posterior distributions by applying platform independent Hellinger and SEN scores37. While the SEN score essentially compares the expectations of genotype distributions, Hellinger score is a measure of the agreement of genotype probabilities. Hellinger score ≥0.45 ensures that the probability of best-guess genotypes is at least 0.49 and the best-guess genotype matches with the original genotypes in almost all cases (see results below). Therefore, this cut-off was used to define well-imputed genotypes in the following.

To find out whether there are significant differences between the imputation scenarios, we formally compared percentages of well-imputed genotypes by McNemar’s test or raw quality measures by Wilcoxon signed rank test. Analyses were performed with the statistical software package R (www.r-project.org). We used 5% as significance threshold throughout all analyses, i.e. we refrained from correcting for multiple comparisons. Since we generally compared the best scenario against the others, we performed one-sided tests throughout. For these analyses, masked HQ SNPs were considered as independent in view of the relatively weak linkage structure of this subset. Only 1% of HQ SNP pairs showed a linkage disequilibrium of r2 ≥ 0.1.

Comparisons using software specific scores: Software specific imputation accuracy measures comprise MaCH-Rsq and IMPUTE-info scores. Both are defined on a SNP-wise rather than genotype level. Although these quality scores do not allow comparisons across software, they are often used to remove poorly imputed SNPs in practice. Hence, we consider these scores in a secondary analysis.

Alternatively, one could calculate the correlation between imputed allele dosages and true genotypes separately for each SNP to assess its imputation quality. This measure is also software independent but does not account for random agreement due to the prior distribution of the imputed genotypes. Analysis shows that this measure is in strong agreement with MaCH-Rsq especially for larger sample sizes (Supplementary Figure S3).

Results

Characteristics of quality scores for comparing different imputation frameworks

Initially, we characterized and compared our imputation accuracy scores (Hellinger score, SEN score and percentages of best guess genotypes matching original genotypes) and the software specific scores (MaCH-Rsq and IMPUTE-info). First, we aimed at identifying a cut-off for Hellinger score to distinguish between correctly imputed genotypes (CIGs) and wrongly imputed genotypes (WIGs). Our analysis revealed that genotype distributions with Hellinger score > = 0.45 always had posterior probability of best-guess genotypes greater than 0.49 and this was sufficient to match the original genotype in almost all cases (see Fig. 1 for AfAm population, representation as boxplots can be found as Supplementary Figure S1). This applies for all POPRES populations considered.

Figure 1: Violin plot of Hellinger scores of genotypes imputed with five different frameworks.
figure 1

Results of African-Americans (AfAm) population are shown. We present results for all imputed genotypes, and separately, for cases where best guess genotypes match true genotypes (correctly imputed) or not (wrongly imputed). A Hellinger score > = 0.45 almost always ensured that the best-guess genotype matches the true genotype.

Since most of the research work comparing software performance17,20 are based on the software specific measures (MaCH-Rsq and IMPUTE-info), we studied these measures in relation to the Hellinger score. Figure 2 shows the results of four example populations of POPRES (German, AfAm, Indian and Japanese). Here, MaCH-Rsq and IMPUTE-info score are only roughly correlated with Hellinger score. Interestingly, for a given value of Hellinger score, SHAPEIT-IMPUTE2 showed clearly higher info scores compared to IMPUTE2. Since Hellinger score is an objective measure of imputation accuracy, we conclude that the info measures of SHAPEIT-IMPUTE2 are inflated. The same trend was observed for MaCH-Minimac versus MaCH but with much lesser magnitude.

Figure 2: Scatterplot between average Hellinger score and Mach-Rsq/IMPUTE-info score for four different POPRES populations imputed with MaCH (using YRI reference panel), MaCH-Minimac (using YRI reference panel), MaCH-Admix, IMPUTE2 and SHAPEIT-IMPUTE2 (using admixed reference panels).
figure 2

For the same Hellinger score, Info scores of SHAPEIT-IMPUTE2 are clearly inflated compared to IMPUTE2.

Of note, MaCH-Rsq and IMPUTE-info strongly depend on the underlying reference panel and can predict the imputation accuracy only under the assumption that the underlying reference panel is genetically very close to the target data set32. In contrast, Hellinger score is independent of software and makes no assumptions regarding the underlying reference panel. Therefore we decided to consider Hellinger score as the primary measure for imputation accuracy in this analysis.

To study inflated accuracy scores for SHAPEIT-IMPUTE2 shown in Fig. 2 in more detail, we analyzed the probability of best-guess genotypes for each of the five frameworks. Results are shown in Fig. 3 (see also Supplementary Figure S2 for alternative representation as box-plots). Interestingly, while the distribution of posterior probabilities of best-guess genotypes are similar for correctly imputed genotypes (CIGs), the distribution of the SHAPEIT-IMPUTE2 values is different for wrongly imputed genotypes (WIGs). In contrast to the other frameworks, SHAPEIT-IMPUTE2 apparently estimates high posterior probabilities also for WIGs. In the sense of Fig. 3, MaCH-Admix shows the most desirable behavior, i.e. low probabilities for wrong best-guess genotypes.

Figure 3: Violin plot of posterior probabilities of best guess genotypes in AfAm population.
figure 3

All imputation frameworks were used with default parameters and reference panels. SHAPEIT-IMPUTE2 shows considerably higher posterior probabilities for wrongly imputed SNPs.

Comparison of Frameworks using Admixed Reference Panels vs Best Matched Reference Panels

Next, we aimed at answering the question if and under which circumstances is the usage of admixed reference panels advantageous compared to specific references panels matched to the target population. More precisely, we analysed the impact of genetic similarity between reference and target population on imputation accuracy. For the imputation frameworks relying on a specific reference, we selected the reference with smallest value of Nei’s as explained in the methods section. We used percentage of Hellinger score > = 45% as primary quality score. A total of 20 populations were analysed with all five imputation frameworks considered (Table 2).

Table 2 Comparison of percentages of genotypes with good Hellinger scores (> = 0.45) obtained for 20 different POPRES samples with either MaCH, MaCH-Minimac, MaCH-Admix, IMPUTE2, or SHAPEIT-IMPUTE2.

When considering frameworks without pre-phasing (MaCH, MaCH-Admix, IMPUTE2), we found that usage of admixed reference panels (MaCH-Admix, IMPUTE2) was advantageous only if the genetic difference between target and reference population was large. In more detail, performance was better when Nei’s GST was close to or greater than about 0.01 which is the case in 6 of the 20 POPRES samples. For POPRES population AfAm, for which no well-matched reference is available, MaCH is clearly outperformed by MaCH-Admix and IMPUTE2. In other words, for well-matched references and homogenous populations as in most of our POPRES samples, the usage of specific references results in superior imputation quality. Considering the pre-phasing frameworks (MaCH-Minimac and SHAPEIT-IMPUTE2) we found that both are clearly outperformed by their counterparts not relying on pre-phasing (MaCH and IMPUTE2, respectively). Results were similar when considering other measures of imputation quality like SEN score, percentages of correctly imputed genotypes based on best guess genotype, and software specific measures of imputation accuracy (supplementary Table S1, S2, and S3, respectively).

We observed a general trend of lower imputation qualities for larger genetic distances to the best matching reference. This also applies for imputation frameworks relying on mixed references (see Supplementary Figure S4).

Comparison of Frameworks Using Admixed Reference Panels

Table 3 shows the results of the comparison of imputation frameworks relying on admixed reference panels (MaCH-Admix, IMPUTE2 and SHAPEIT-IMPUTE2). For this purpose, we also consider three different missing scenarios to account for the impact of missingness on efficacy of the imputation frameworks. Again, we used McNemar’s test to compare the scenarios.

Table 3 Percentage of Genotypes with good Hellinger score (> = 0.45) for three imputation frameworks considering mixed reference panels:

MaCH-Admix and IMPUTE2 showed comparable performance. IMPUTE2 had an advantage compared to MaCH-Admix especially for larger percentages of missingness but the difference was insignificant in general. In contrast, SHAPEIT-IMPUTE2 always showed significantly inferior results.

Results for SEN score are similar (results not shown). We also determined the percentage of correctly imputed best-guess genotypes (Table 4). Results are similar to those of the Hellinger score except for the fact that here, one can observe a slight but insignificant advantage of MaCH-Admix compared to IMPUTE2. Hence, IMPUTE2 tends to be more confident at certain SNPs while MaCH-Admix has a slightly higher average yield of correctly guessed genotypes. Again, SHAPEIT-IMPUTE2 showed significantly poorer performance than the other frameworks.

Table 4 Percentage of most likely genotypes which agree with the original genotypes for three imputation frameworks considering mixed reference panels:

Comparison of frameworks relying on pre-phasing

Table 5 shows results of the comparison of frameworks using pre-phasing (MaCH-Minimac and SHAPEIT-IMPUTE2). As primary quality measure, percentage of genotypes with good Hellinger score (≥0.45) was used. As observed in Table 2, small Nei’s GST between reference and target population were advantageous for MaCH-Minimac relying on specific reference panels. However, there was a trend that the difference to SHAPEIT-IMPUTE2 became smaller when missingness increases. For those populations, whose genetic distances from the best-matching reference population is large, SHAPEIT-IMPUTE2 performed slightly better than MaCH-Minimac, however in many cases the difference was insignificant.

Table 5 Percentage of genotypes with good Hellinger score (> = 0.45) for imputation frameworks with pre-phasing strategy:

Similar results are obtained for the SEN score (see Supplementary Table S4). Again, we analysed the percentage of correctly guessed genotypes (Table 6). We found that MaCH-Minimac performed always better than SHAPEIT-IMPUTE2 except in the case of 70% and 100% missing scenarios for “AfAm” population. In these two scenarios, SHAPEIT-IMPUTE2 showed insignificantly better performance. This underlines the importance of admixed references for imputation of AfAm for which no well matching reference is available.

Table 6 Percentage of well-imputed best-guess genotypes for two imputation frameworks relying on pre-phasing.

Impact of sample size

The impact of sample size on the performance of imputation frameworks was studied in LIFE-Adult. Results are shown in Table 7. Again, methods without pre-phasing have higher accuracy than their counterparts relying on pre-phasing. But the difference becomes smaller with increasing sample size. MaCH is superior to IMPUTE2 for small datasets but for larger datasets, the opposite is true.

Table 7 Dependence of imputation accuracy on sample size studied in LIFE-Adult.

Discussion

In the present paper, we compared the imputation frameworks MaCH, IMPUTE2, MaCH-Admix, MaCH-Minimac and SHAPEIT-IMPUTE2 in a comprehensive simulation study of POPRES samples. We were interested if and under which circumstances pre-phasing or usage of admixed references panels is advantageous.

Genotype imputation is nowadays common in genome-wide data analysis. Although, frameworks such as MaCH, IMPUTE2 and Beagle are well established and result in generally good imputation quality, there are several attempts regarding further improvements. First, in order to deal with larger data sets, pre-phasing was established which significantly accelerates imputation speed17. According to this strategy, the haplotypes underlying the target dataset are estimated first. Then, these haplotypes were used to estimate the genotypes. The two imputation frameworks SHAPEIT-IMPUTE2 and MaCH-Minimac adopt this concept17. While SHAPEIT-IMPUTE2 uses an admixed reference panel as input and let the software choose a “custom” reference panel, MaCH-Minimac basically depends on a reference panel that is best matched with the target dataset. Second, admixed populations becoming more and more frequent in genetic epidemiologic research. Therefore, frameworks accepting admixed reference populations were developed18,20. There is also some hope that admixed references might improve the imputation accuracy for populations for which no well-matching reference is at hand. The software IMPUTE2 and MaCH-Admix implemented this approach. Both software implemented an IBS-based strategy for selecting the reference panel but the latter’s IBS-matching strategy is in a piecewise manner. So far, only few published studies compared the relative performance of imputation concepts of pre-phasing or accounting for admixture17. Conclusions from these studies are limited since their findings were based on the IMPUTE-Info score as quality measure, only. According to our results (Fig. 2), IMPUTE-Info score strongly depends on the reference panel used. In our study, we used scores that allow a direct comparison of imputed and true genotypes. Using these measures, we compared the above mentioned imputation frameworks in a comprehensive simulation study.

Our simulation study is based on the general idea of masking SNPs, re-imputing them and comparing the results using a variety of measures. Only good quality SNPs were masked to ensure that expressed genotypes are correct with high certainty. As in earlier studies37, we considered Hellinger score as the primary outcome of the comparison of masked and re-imputed genotypes. The score is maximal if and only if the two genotype distributions coincide. In our simulation study, we showed that a Hellinger score > = 0.45 almost ensures that the best-guess genotype is correct. This applies for all software and simulation scenarios considered. We studied SEN score and percentage of correct best-guess genotypes as alternative objective measures of imputation quality. Results were in general similar to those of Hellinger score.

Although, software specific measures of imputation quality such as MaCH-Rsq and IMPUTE-info are widely used to assess imputation accuracy, our results suggest that these measures should not be used as objective (absolute) measures of imputation accuracy. First, these measures depend on the reference panel considered32. Second, we observed a strong inflation of IMPUTE-info for the framework SHAPEIT-IMPUTE2 and numerous best-guess genotypes are wrong even if IMPUTE-info is high. This could explain for example the results of Howie et al.17 which was based on IMPUTE-info scores. This study concluded that SHAPEIT-IMPUTE2 and IMPUTE2 perform similarly. However, our simulation study shows that IMPUTE2 without pre-phasing is considerably better. Moreover, we recommend applying higher IMPUTE-info thresholds for SHAPEIT-IMPUTE2 than for IMPUTE2 to achieve similar imputation quality. We generally observed that software frameworks with pre-phasing strategy performed inferior compared to their equivalents without pre-phasing. Thus, there is a trade-off between imputation accuracy and cost of computational time. However, our analysis of LIFE-Adult shows that the disadvantage of pre-phasing decreases for larger sample sizes.

Regarding the performance of admixed reference panels, it was necessary to study a variety of genetic ethnicities. Therefore, we created 20 different ethnic data subsets of chromosome 22 from the POPRES project26. Each ethnic data subset consisted of equal numbers of individuals (N = 40). Limitations of this approach are the relatively low number of cases as well as the fact that no true admixed target population was considered. Therefore, results might be valid only for small or medium-sized data sets.

As imputation references, we considered the HapMap3 samples CEU, YRI, MEX and JPT + CHB as possible best-matched references. For our POPRES samples, we selected the reference with minimal Nei’s GST as recommended32. For software relying on admixed references, a corresponding HapMap reference was selected. Usage of HapMap references is a limitation of our study. However, in view of the small case numbers of POPRES populations, imputation of rare and low frequency variants is futile (see also supplementary figure S5), and therefore, we have to focus on common variants which are well represented in the HapMap panels.

Comparison of MaCH-Admix using an admixed reference versus MaCH using a specific reference showed that the specific references are advantageous as long as there is a well-matching reference population. A cut-off of Nei’s GST of 0.01 could serve as a rough decision rule whether an admixed reference should be preferred. The software relying on admixed references without pre-phasing, MaCH-Admix and IMPUTE2, performed similarly. However, one has to acknowledge here that this was shown only for small genetically homogeneous populations as those of POPRES.

In summary, admixed references outperformed best-matched references only if the genetic distance was large (Nei’s GST > 0.01). Pre-phasing reduces imputation accuracy, but the difference becomes smaller for larger data sets. Relative measures of imputation accuracy such as MaCH-Rsq and IMPUTE-info should be considered with caution when interpreting and comparing imputation accuracy, since they depend on the reference and the imputation framework. Our conclusions are valid for genetically homogenous populations of small to moderate sample size.

Additional Information

How to cite this article: Roshyara, N. R. et al. Comparing performance of modern genotype imputation methods in different ethnicities. Sci. Rep. 6, 34386; doi: 10.1038/srep34386 (2016).