Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A reference panel of 64,976 haplotypes for genotype imputation

Abstract

We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Performance of imputation using different reference panels.
Figure 2: Association signal for the α1-antitrypsin phenotype at the SERPINA1 locus.

References

  1. 1

    International HapMap Consortium. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007).

  2. 2

    1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  3. 3

    Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 818–825 (2014).

  4. 4

    Huang, J. et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 6, 8111 (2015).

    CAS  Article  Google Scholar 

  5. 5

    Sidore, C. et al. Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers. Nat. Genet. 47, 1272–1281 (2015).

    CAS  Article  Google Scholar 

  6. 6

    Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).

    CAS  Article  Google Scholar 

  7. 7

    Howie, B.N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).

    Article  Google Scholar 

  8. 8

    Li, Y., Willer, C.J., Ding, J., Scheet, P. & Abecasis, G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).

    Article  Google Scholar 

  9. 9

    Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G.R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).

    CAS  Article  Google Scholar 

  10. 10

    Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).

    CAS  Article  Google Scholar 

  11. 11

    Fuchsberger, C., Abecasis, G.R. & Hinds, D.A. minimac2: faster genotype imputation. Bioinformatics 31, 782–784 (2015).

    CAS  Article  Google Scholar 

  12. 12

    O'Connell, J. et al. Haplotype estimation for biobank-scale data sets. Nat. Genet. 48, 817–820 (2016).

    CAS  Article  Google Scholar 

  13. 13

    Ferrucci, L. et al. Subsystems contributing to the decline in ability to walk: bridging the gap between epidemiology and geriatric practice in the InCHIANTI study. J. Am. Geriatr. Soc. 48, 1618–1625 (2000).

    CAS  Article  Google Scholar 

  14. 14

    Melzer, D. et al. A genome-wide association study identifies protein quantitative trait loci (pQTLs). PLoS Genet. 4, e1000072 (2008).

    Article  Google Scholar 

  15. 15

    Wood, A.R. et al. Imputation of variants from the 1000 Genomes Project modestly improves known associations and can identify low-frequency variant–phenotype associations undetected by HapMap based imputation. PLoS One 8, e64343 (2013).

    CAS  Article  Google Scholar 

  16. 16

    Bathurst, I.C., Travis, J., George, P.M. & Carrell, R.W. Structural and functional characterization of the abnormal Z α1-antitrypsin isolated from human liver. FEBS Lett. 177, 179–183 (1984).

    CAS  Article  Google Scholar 

  17. 17

    Ferrarotti, I. et al. Serum levels and genotype distribution of α1-antitrypsin in the general population. Thorax http://dx.doi.org/10.1136/thoraxjnl-2011-201321 (2012).

  18. 18

    Sharp, K., Kretzschmar, W., Delaneau, O. & Marchini, J. Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics 32, 1974–1980 (2016).

    CAS  Article  Google Scholar 

  19. 19

    CONVERGE Consortium. Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature 523, 588–591 (2015).

  20. 20

    Gurdasani, D. et al. The African Genome Variation Project shapes medical genetics in Africa. Nature 517, 327–332 (2015).

    CAS  Article  Google Scholar 

  21. 21

    Rosenberg, N.A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).

    CAS  Article  Google Scholar 

  22. 22

    Wang, Y., Lu, J., Yu, J., Gibbs, R.A. & Yu, F. An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res. 23, 833–842 (2013).

    CAS  Article  Google Scholar 

  23. 23

    Völzke, H. et al. Cohort profile: the study of health in Pomerania. Int. J. Epidemiol. 40, 294–307 (2011).

    Article  Google Scholar 

  24. 24

    Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We are grateful to all participants of all the studies that have contributed data to the HRC. J.M. acknowledges support from the ERC (grant 617306). W.K. acknowledges support from the Wellcome Trust (grant WT097307). S. McCarthy and R.D. acknowledge support from Wellcome Trust grant WT090851. A full list of acknowledgments for the cohorts is given in the Supplementary Note.

Author information

Affiliations

Consortia

Contributions

The HRC was initially conceived by discussions between J.M., G.A., R.D., M.I.M. and M.B. Analysis and methods development were carried out by S. McCarthy, S.D., W.K., O.D., A.R.W., P.D. and H.M.K. Supervision of the research was provided by J.M., G.A. and R.D. The Michigan Imputation Server was developed by C.F., L. Forer S.S. and G.A. The Sanger Imputation Service was developed by P.D., S. McCarthy and R.D. The Oxford Statistics Phasing Server was developed by W.K., K. Sharp and J.M. All other authors contributed data sets to the project or provided advice.

Corresponding authors

Correspondence to Richard Durbin or Gonçalo Abecasis or Jonathan Marchini.

Ethics declarations

Competing interests

The author declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 The effect of sites filtering on Ts/Tv ratio per sample

The top figure shows the per-sample transition-transversion ratio (Ts/Tv) for chromosome 20 after running the GLPhase genotype calling method on the full MAC5 site list. In the bottom figure, GLPhase was run after the site filtering described in the text.

Supplementary Figure 2 Data summaries before and after site filtering

Figure a shows the number of sites in the unfiltered and filtered MAC5 site lists (chromosome 20) stratified by non-reference allele frequency. The allele frequency here is calculated from the genotypes made after running the GLPhase genotype calling method on the full MAC5 site list. Figure b shows the corresponding transition-transversion ratio (Ts/Tv) of these sites.

Supplementary Figure 3 Performance of imputation using different reference panels

The x-axis shows the non-reference allele frequency of the SNP being imputed on a log scale. The y-axis shows imputation accuracy measured by aggregate r2 when imputing SNP genotypes into 10 CEU samples. These results are based on using genotypes from sites on Illumina Core Exome SNP array.

Supplementary Figure 4 Performance of imputation using different reference panel.

The x-axis shows the non-reference allele frequency of the SNP being imputed on a log scale. The y-axis shows imputation accuracy measured by aggregate r2 when imputing SNP genotypes into 10 CEU samples. These results are based on using genotypes from sites on Illumina OMNI 5M SNP array.

Supplementary Figure 5 Site stratification by calling and filtering status across cohorts.

On the x-axis we show the number of studies a variant was called in (out of 20) and on the y-axis we show the number of times it was filtered out by the cohort-specific internal QC pipelines. The color shows the percentage of variants in each such cell (red means more than 10% of variants lie in that cell while blue means less than 0.1%). The number to the top right of each cell denotes the Ts/Tv ratio for all sites in that cell. Cells higher in the plot have been filtered out relatively often and usually represent poor variants, as is also seen from the low Ts/Tv ratio. All variants above the red line were filtered out (which excludes all cells which had been filtered independently by more than 4 studies or have Ts/Tv ratio less than 1.7)

Supplementary Figure 6 Comparison of methods for genotype calling as sample size increases

The figure shows a log-log plot of run time vs sample size for four different methods of genotype calling from GL data. For each sample size 5 random 1024 site chunks from chromosome 20 were used. Each dot represents the run time of a single dataset. Lines are drawn between successive means of run times for each value of sample size

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–6, Supplementary Tables 1–8 and Supplementary Note. (PDF 1898 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

the Haplotype Reference Consortium., McCarthy, S., Das, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet 48, 1279–1283 (2016). https://doi.org/10.1038/ng.3643

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing