Population-specific genotype imputations using minimac or IMPUTE2

Abstract

In order to meaningfully analyze common and rare genetic variants, results from genome-wide association studies (GWASs) of multiple cohorts need to be combined in a meta-analysis in order to obtain enough power. This requires all cohorts to have the same single-nucleotide polymorphisms (SNPs) in their GWASs. To this end, genotypes that have not been measured in a given cohort can be imputed on the basis of a set of reference haplotypes. This protocol provides guidelines for performing imputations with two widely used tools: minimac and IMPUTE2. These guidelines were developed and used by the Genome of the Netherlands (GoNL) consortium, which has created a population-specific reference panel for genetic imputations and used this reference to impute various Dutch biobanks. We also describe several factors that might influence the final imputation quality. This protocol, which has been used by the largest Dutch biobanks, should take approximately several days, depending on the sample size of the biobank and the computer resources available.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Workflow of the imputation protocol for imputations of unobserved genotypes with the GoNL reference panel.
Figure 2: The percentage of SNPs with R2 > 0.3 after imputing chromosome 21 of 5,974 samples of Rotterdam Study cohort I when the target set is split into several chunks of chromosomes and the percentage overlap between chunks is 10%, and when the chromosome of the target set is split into 5 Mb chunks and the size of the overlap is varied.

References

  1. 1

    International HapMap 3 Consortium. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).

  2. 2

    1000 Genomes Project Consortium. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  3. 3

    Boomsma, D.I. et al. The Genome of the Netherlands: design, and project goals. Eur. J. Hum. Genet. 22, 221–227 (2014).

    CAS  Article  Google Scholar 

  4. 4

    Deelen, P. et al. Improved imputation quality of low-frequency and rare variants in European samples using the 'Genome of The Netherlands'. Eur. J. Hum. Genet. 22, 1321–1326 (2014).

    CAS  Article  Google Scholar 

  5. 5

    Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 818–825 (2014).

  6. 6

    Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).

    CAS  Article  Google Scholar 

  7. 7

    Anderson, C.A. et al. Data quality control in genetic case-control association studies. Nat. Protoc. 5, 1564–1573 (2010).

    CAS  Article  Google Scholar 

  8. 8

    Browning, B.L. & Browning, S.R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210–223 (2009).

    CAS  Article  Google Scholar 

  9. 9

    Howie, B.N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).

    Article  Google Scholar 

  10. 10

    Verma, S.S. et al. Imputation and quality control steps for combining multiple genome-wide datasets. Front. Genet. 5, 370 (2014).

    Article  Google Scholar 

  11. 11

    Winkler, T.W. et al. Quality control and conduct of genome-wide association meta-analyses. Nat. Protoc. 9, 1192–1212 (2014).

    Article  Google Scholar 

  12. 12

    Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & calo R Abecasis, G. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).

    CAS  Article  Google Scholar 

  13. 13

    van Leeuwen, E.M. et al. Genome of the Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels. Nat. Commun. 6, 6065 (2015).

    CAS  Article  Google Scholar 

  14. 14

    Jostins, L., Morley, K.I. & Barrett, J.C. Imputation of low-frequency variants using the HapMap3 benefits from large, diverse reference sets. Eur. J. Hum. Genet. 19, 662–666 (2011).

    Article  Google Scholar 

  15. 15

    Pistis, G. et al. Rare variant genotype imputation with thousands of study-specific whole-genome sequences: implications for cost-effective study designs. Eur. J. Hum. Genet. 23, 975–983 (2014).

    Article  Google Scholar 

  16. 16

    Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).

    CAS  Article  Google Scholar 

  17. 17

    Nho, K. et al. The effect of reference panels and software tools on genotype imputation. AMIA Annu. Symp. Proc. 2011, 1013–1018 (2011).

    PubMed  PubMed Central  Google Scholar 

  18. 18

    Li, Y., Willer, C.J., Ding, J., Scheet, P. & calo R Abecasis, G. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).

    Article  Google Scholar 

  19. 19

    Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).

    CAS  Article  Google Scholar 

  20. 20

    Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    CAS  Article  Google Scholar 

  21. 21

    Roshyara, N.R. & Scholz, M. fcGENE: a versatile tool for processing and transforming SNP datasets. PLoS ONE 9, e97589 (2014).

    Article  Google Scholar 

  22. 22

    Nelson, S.C., Doheny, K.F., Laurie, C.C. & Mirel, D.B. Is 'forward' the same as 'plus'?...and other adventures in SNP allele nomenclature. Trends Genet. 28, 361–363 (2012).

    CAS  Article  Google Scholar 

  23. 23

    Deelen, P. et al. Genotype harmonizer: automatic strand alignment and format conversion for genotype data integration. BMC Res. Notes 7, 901 (2014).

    Article  Google Scholar 

  24. 24

    Kent, W.J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).

    CAS  Article  Google Scholar 

  25. 25

    Sulovari, A. & Li, D. Gact: a genome build and allele definition conversion tool for SNP imputation and meta-analysis in genetic association studies. BMC Genomics 15, 610 (2014).

    Article  Google Scholar 

  26. 26

    de Bakker, P.I.W. et al. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum. Mol. Genet. 17, R122–R128 (2008).

    CAS  Article  Google Scholar 

  27. 27

    Wang, Z. et al. Improved imputation of common and uncommon SNPs with a new reference set. Nat. Genet. 44, 6–7 (2012).

    CAS  Article  Google Scholar 

  28. 28

    Abecasis, G.R., Cherny, S.S., Cookson, W.O. & Cardon, L.R. Merlin–rapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet. 30, 97–101 (2002).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We acknowledge the Genetic Cluster Computer (http://www.geneticcluster.org), which is financially supported by the Netherlands Scientific Organization (NWO 480-05-003) along with a supplement from the Dutch Brain Foundation and the VU University Amsterdam. We thank SURFsara Computing and Networking Services (http://www.surfsara.nl) for their support in using the Lisa Compute Cluster. This work was supported by the BioAssist Biobanking Task Force of the Netherlands Bioinformatics Centre, which is supported by the Netherlands Genomics Initiative. This work is part of the program of BiG Grid, the Dutch e-Science Grid, which is financially supported by the Nederlandse Organisatie voor Wetenschappelijk Onderzoek (Netherlands Organisation for Scientific Research, NWO). This work was financed as a Rainbow Project of the Biobanking and Biomolecular Research Infrastructure Netherlands (BBMRI-NL, RP-2), a Research Infrastructure financed by the Dutch government (NWO 184.021.007). The work of L.C.K. was partially funded by the European Union FP7 (2007–2013) program under grant agreement numbers 305280 (MIMOmics) and 602736 (PainOmics).

Author information

Affiliations

Authors

Consortia

Contributions

E.M.v.L., A.K., L.C.K. and J.J.H. wrote the first draft of the article. A.K., E.M.v.L., M.V.K. and J.J.H. performed analyses. E.M.v.L., P.D. and M.V.K. designed the protocol. D.I.B. performed study design and genotyping of the Netherlands Twin Registry. A.K., P.D., M.V.K., P.I.W.d.B., C.W., M.A.S., D.I.B., C.M.v.D., L.C.K., P.E.S. and J.J.H. revised the article.

Corresponding author

Correspondence to Cornelia M van Duijn.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

A full list of members and their affiliations is available in the Supplementary Note.

Integrated supplementary information

Supplementary Figure 1 The walltimes when splitting up the data set.

The walltimes per job for MaCH (a, c, e) and minimac (b, d, f) for various ways of splitting up the data set. The walltime is the time as measured by a clock on the wall (CPU time, disk writing etcetera) required to impute the target set. The walltime per job for running MaCH fits the linear regression models t=8.6 + 1.13n (Figure a), t=86.49 + 270.02n (Figure c) and t=1568.3 + 2.7n (Figure e). The walltime per job for running minimac fits the linear regression model t=33.8 + 0.13n (split before MaCH (blue circles)), t=50.2 + 0.10n (split after MaCH (green squares)) (Figure b), t=688.6 + 3.29n (Figure d) and t=687.7 + 0.02n (Figure f). t is the walltime in minutes and n the number of samples (a, b), the size of the chunks in Mb (c, d) and the percentage of overlap (e, f). The percentage overlap is 10% in Figure c and d and the chunk size is 5Mb in Figure e and f.

Supplementary information

Supplementary Text and Figures

Supplementary Figure 1 and Supplementary Note (PDF 302 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

van Leeuwen, E., Kanterakis, A., Deelen, P. et al. Population-specific genotype imputations using minimac or IMPUTE2. Nat Protoc 10, 1285–1296 (2015). https://doi.org/10.1038/nprot.2015.077

Download citation

Further reading

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing