Technical Report | Published:

Improved genome inference in the MHC using a population reference graph

Nature Genetics volume 47, pages 682688 (2015) | Download Citation


Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.


NCBI Reference Sequence


  1. 1.

    & Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

  2. 2.

    , & Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).

  3. 3.

    & Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21, 936–939 (2011).

  4. 4.

    et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

  5. 5.

    et al. Variation analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype Project. Immunogenetics 60, 1–18 (2008).

  6. 6.

    et al. Copy number variation leads to considerable diversity for B but not A haplotypes of the human KIR genes encoding NK cell receptors. Genome Res. 22, 1845–1854 (2012).

  7. 7.

    et al. Large multi-chromosomal duplications encompass many members of the olfactory receptor gene family in the human genome. Hum. Mol. Genet. 7, 2007–2020 (1998).

  8. 8.

    et al. Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat. Genet. 44, 872–880 (2012).

  9. 9.

    , , & Structural haplotypes and recent evolution of the human 17q21.31 region. Nat. Genet. 44, 881–885 (2012).

  10. 10.

    et al. A common inversion under selection in Europeans. Nat. Genet. 37, 129–137 (2005).

  11. 11.

    & Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. PLoS Genet. 1, e49 (2005).

  12. 12.

    1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  13. 13.

    The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

  14. 14.

    1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  15. 15.

    , & Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).

  16. 16.

    , , & A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res. 14, 2336–2346 (2004).

  17. 17.

    et al. Cactus graphs for genome comparisons. J. Comput. Biol. 18, 469–481 (2011).

  18. 18.

    , & Mapping to a reference genome structure. ArXiv (2014).

  19. 19.

    et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).

  20. 20.

    & Haplotype-based variant detection from short-read sequencing. ArXiv (2012).

  21. 21.

    , & Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).

  22. 22.

    et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).

  23. 23.

    , , , & De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

  24. 24.

    & Adding unaligned sequences into an existing alignment using MAFFT and LAST. Bioinformatics 28, 3144–3146 (2012).

  25. 25.

    et al. Fast statistical alignment. PLoS Comput. Biol. 5, e1000392 (2009).

  26. 26.

    et al. IMGT, the international ImMunoGeneTics information system. Nucleic Acids Res. 37, D1006–D1012 (2009).

  27. 27.

    Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28, 1838–1844 (2012).

  28. 28.

    et al. Comprehensive variation discovery in single human genomes. Nat. Genet. 46, 1350–1355 (2014).

  29. 29.

    , , , & A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).

  30. 30.

    , , , & Low-coverage sequencing: implications for design of complex trait association studies. Genome Res. 21, 940–951 (2011).

  31. 31.

    et al. The HLA dictionary 2008: a summary of HLA-A, -B, -C, -DRB1/3/4/5, and -DQB1 alleles and their association with serologically defined HLA-A, -B, -C, -DR, and -DQ antigens. Tissue Antigens 73, 95–170 (2009).

  32. 32.

    et al. Ensembl 2013. Nucleic Acids Res. 41, D48–D55 (2013).

  33. 33.

    , , & Lapatinib-induced liver injury characterized by class II HLA and Gilbert's syndrome genotypes. Clin. Pharmacol. Ther. 91, 647–652 (2012).

Download references


We thank M. Eberle and colleagues at Illumina for early access to the Moleculo data. The study was funded by grants from GlaxoSmithKline and grant 100956/Z/13/Z from the Wellcome Trust to G.M., a Nuffield Department of Medicine Fellowship to Z.I. and a Sir Henry Dale Fellowship jointly awarded by the Wellcome Trust and the Royal Society to Z.I. (102541/Z/13/Z).

Author information


  1. Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.

    • Alexander Dilthey
    • , Zamin Iqbal
    •  & Gil McVean
  2. Department of Quantitative Sciences, GlaxoSmithKline, Stevenage, UK.

    • Charles Cox
  3. Department of Quantitative Sciences, GlaxoSmithKline, Research Triangle Park, North Carolina, USA.

    • Matthew R Nelson


  1. Search for Alexander Dilthey in:

  2. Search for Charles Cox in:

  3. Search for Zamin Iqbal in:

  4. Search for Matthew R Nelson in:

  5. Search for Gil McVean in:


G.M. designed the experiment. A.D. and C.C. performed analyses. Z.I., M.R.N. and G.M. supervised the research. A.D. and G.M. wrote the manuscript with the assistance of co-authors.

Competing interests

C.C. and M.R.N. are employed by GlaxoSmithKline (GSK) and may own GSK stock. GSK does not sell or market any software or services related to genetic analysis or the generation of genetic data. G.M. is a founder and shareholder of Genomics, Ltd. G.M. and A.D. are partners in Peptide Groove, LLP.

Corresponding authors

Correspondence to Alexander Dilthey or Gil McVean.

Integrated supplementary information

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–3, Supplementary Tables 1–9 and Supplementary Note

Zip files

  1. 1.

    Supplementary Data Set: Discrepancies between SNP array and PRG genotypes in NA12878.

    Compressed (zip) file with screenshots showing read mapping at the 55 sites where the Viertbi-inferred genotype from the PRG disagrees with the SNP array genotype and where the PRG specifies a gap character. A manual evaluation of these sites is also provided as an Excel file.

About this article

Publication history





Further reading