Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel

Imputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced at low depth (average 7x), aiming to exhaustively characterize genetic variation down to 0.1% minor allele frequency in the British population. Here we demonstrate the value of this resource for improving imputation accuracy at rare and low-frequency variants in both a UK and an Italian population. We show that large increases in imputation accuracy can be achieved by re-phasing WGS reference panels after initial genotype calling. We also present a method for combining WGS panels to improve variant coverage and downstream imputation accuracy, which we illustrate by integrating 7,562 WGS haplotypes from the UK10K project with 2,184 haplotypes from the 1000 Genomes Project. Finally, we introduce a novel approximation that maintains speed without sacrificing imputation accuracy for rare variants.


. Imputation performance of different reference panels
Imputation accuracy in the UK10K pseudo-GWAS test panel using reference panels from 1000GP (black), UK10K (blue), and UK10K+1000GP (red) across all MAFs. The "original" UK10K reference panel (dotted blue line) was produced by standard genotype refinement of low-coverage sequencing data, while the "rephased" reference panel (solid blue line) was produced by running SHAPEIT2 on the genotypes called by BEAGLE to improve haplotype accuracy. The rephased UK10K panel was combined with the 1000GP panel to produce the UK10K+1000GP panel. SNPs with MAF<5% in the INCIPE pseudo-GWAS panel were imputed with the UK10K reference panel under two different IMPUTE2 settings: one that used a Hamming distance approximation to choose a customized subset of 500 reference haplotypes when imputing each study haplotype (x-axis; mean r 2 =0.27), and one that used all available reference haplotypes with no approximation (y-axis; mean r 2 =0.33

Supplementary Notes
Supplementary Note 1. Imputation strategy and novel software functionality for merging

WGS datasets
Genotype imputation is now widely used in GWAS to boost power, carry out fine-mapping and facilitate meta-analysis 1 . Usually imputation is carried out using a single haplotype reference panel, such as those produced by the HapMap project or the 1000 Genomes Project. We have developed a new option in the Impute2 software 2,3 that allows two sets of haplotypes to be combined to form a single set of haplotypes at the union set of sites.
Imputation into GWAS samples can then be carried out using this combined panel. This method can be used to combine two sets of haplotypes from two distinct population cohorts, such as UK10K and 1000 Genomes, as we have done in this paper. Alternatively, it may be that a particular study has sequenced specific individuals with high relevance to the GWAS, and wish to combine that set of haplotypes with one of the publicly available haplotype sets.
The main difficulty in combining reference panels is that some sites will only have data in one or other of the panels. This maybe because the site is monomorphic for the reference allele in the cohort, in which case the site is unlikely to have been `called' from the sequencing. However, the site may also be polymorphic and may not have been called due to low-coverage of the non-reference allele, or due to cohort specific site filtering that removed the site from consideration. We impute the untyped variants in three steps: 1. Impute the variants that are specific to Panel 0 (red) into Panel 1 (blue). Variants shown in grey do not inform the imputation.
2. Impute the variants that are specific to Panel 1 (blue) into Panel 0 (red). Variants shown in grey do not inform the imputation.
3. Now that we have imputed the two reference panels up to the union of their variants, treat the imputed haplotypes as known (i.e., take the best-guess haplotypes) and impute the GWAS cohort in the usual way.
Our implementation allows for the use of unphased or pre-phased GWAS samples. In addition, Impute2 outputs a file containing a merged haplotype reference panel that can be used for future imputation without repeating this step. This new functionality is available in IMPUTE2 v2.3.1 at https://mathgen.stats.ox.ac.uk/impute/impute_v2.html.

Background and motivation
Genotype imputation in GWAS has always been a computationally intensive task. Recent developments like pre-phasing have greatly reduced the computational cost of imputation, but growing reference panels continue to challenge existing methods. As we were conducting the analyses for this manuscript, we evaluated an approximation developed by

Understanding limitations of the Hamming approximation
To better understand why the Hamming distance approximation failed to successfully impute the rare SNP highlighted above, we examined the state-copying probabilities generated by IMPUTE2 when no approximation was used with the UK10K reference panel.
These probabilities are calculated at each site that is shared between a GWAS data set and a reference panel. At a given site, the reference haplotypes ("copying states") with the largest probabilities contribute the most to the imputation. 25 of the 500 selected states were among the 103 states in this plot, but these did not include the haplotype carrying the highlighted rare variant, which is why it was not successfully imputed by the Hamming method.

Supplementary
A notable feature of this plot is that the copied states change frequently along the region, which is a consequence of the high recombination rate in this region (average of 3.5 cM/Mb). It can also be seen that the shared haplotype tract of interest, shown as a row of red dots, is distinctive and short: within the range of the red dots, this haplotype is often the only one with a meaningful copying probability, yet the shared tract is only ~300kb long (many alleles at 0.1% frequency reside on longer haplotype backgrounds). There is a clear signal of haplotype sharing to be found here, but it is not easy to detect via region-wide metrics like Hamming distance.
Observations like this led us to develop a new approximation that focuses on capturing the shared reference haplotype tracts around each site in a study haplotype, rather than averaging these out with region-wide metrics. Our goal was to capture the same kind of information used by methods like MVNcall 5 for one site at a time, but to do so in a way that produces an ensemble of k hap haplotypes that can be used to impute an entire region, analogous to the current Hamming distance approach used by IMPUTE2.
A novel tract sharing approximation: The goal behind our new approximation is to ensure that each site in a study haplotype has the opportunity to copy the reference haplotype with the longest shared tract of allelic identity. If this goal can be fulfilled with fewer than k hap reference haplotypes, we continue adding haplotypes with shorter shared tracts until k hap unique states have been selected.
This approach aims to capture local copying information while allowing a user to control the computational costs via k hap , as is currently done with the Hamming distance method.
Our algorithm works as follows, from the point of view of a single GWAS haplotype: 5. Go to the next-ranked haplotype index ("level") and repeat Step 4 until k hap distinct reference haplotypes have been identified. If the number of selected haplotypes exceeds k hap at a particular level, choose a random subset of the reference indices at that level such that the total number of selected haplotypes is k hap .
Supplementary Figure 5B shows that this algorithm is much more effective than the region-  Figure 6A shows that the new approximation with k hap =500 provides essentially the same accuracy as using the entire UK10K reference panel: the mean r 2 values in this analysis were 0.32 and 0.33, respectively, and none of the SNPs imputed well (r 2 >0.8) by the full reference panel were missed when using the approximation -this includes the rare SNP that was previously imputed poorly by the Hamming distance method (red dot). To see if we could push this approach even further, we also ran the tract sharing approximation with k hap =60 (Supplementary Figure 6B). The accuracy suffered a bit at this setting (mean r 2 =0.30), but the results were still better than the analysis with k hap =500 under the Hamming metric, and again there were few major discrepancies between the results with this approximation versus the full reference panel.

Conclusions
In summary, our new tract sharing approximation has a similar computational cost to the Hamming distance approximation of 3 , but it is better at maintaining imputation accuracy for low-frequency and rare SNPs. We believe that this will be a useful approach as imputation reference panels continue to grow.

Supplementary Note 3. Re-phasing and imputation commands.
These are the command options for imputation using the combined UK10K+1000GP panel in IMPUTE2, using as an example for one region of chromosome 20.