Improved genome inference in the MHC using a population reference graph


Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Read mapping in the MHC class II region.
Figure 2: Schematic showing the construction and application of a PRG.
Figure 3: Simulation study and empirical validation.
Figure 4: Recovery of chromotype k-mers from HTS data.
Figure 5: Spatial recovery of k-mers within the HLA class II region.
Figure 6: Alignment of synthetic long-read data to chromotypes.

Accession codes


NCBI Reference Sequence


  1. 1

    Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

    Article  Google Scholar 

  2. 2

    Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).

    CAS  Article  Google Scholar 

  3. 3

    Lunter, G. & Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21, 936–939 (2011).

    CAS  Article  Google Scholar 

  4. 4

    Zook, J.M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    CAS  Article  Google Scholar 

  5. 5

    Horton, R. et al. Variation analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype Project. Immunogenetics 60, 1–18 (2008).

    CAS  Article  Google Scholar 

  6. 6

    Jiang, W. et al. Copy number variation leads to considerable diversity for B but not A haplotypes of the human KIR genes encoding NK cell receptors. Genome Res. 22, 1845–1854 (2012).

    CAS  Article  Google Scholar 

  7. 7

    Trask, B.J. et al. Large multi-chromosomal duplications encompass many members of the olfactory receptor gene family in the human genome. Hum. Mol. Genet. 7, 2007–2020 (1998).

    CAS  Article  Google Scholar 

  8. 8

    Steinberg, K.M. et al. Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat. Genet. 44, 872–880 (2012).

    CAS  Article  Google Scholar 

  9. 9

    Boettger, L.M., Handsaker, R.E., Zody, M.C. & McCarroll, S.A. Structural haplotypes and recent evolution of the human 17q21.31 region. Nat. Genet. 44, 881–885 (2012).

    CAS  Article  Google Scholar 

  10. 10

    Stefansson, H. et al. A common inversion under selection in Europeans. Nat. Genet. 37, 129–137 (2005).

    CAS  Article  Google Scholar 

  11. 11

    Lupski, J.R. & Stankiewicz, P. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. PLoS Genet. 1, e49 (2005).

    Article  Google Scholar 

  12. 12

    1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  13. 13

    The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

  14. 14

    1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  15. 15

    Lee, C., Grasso, C. & Sharlow, M.F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).

    CAS  Article  Google Scholar 

  16. 16

    Raphael, B., Zhi, D., Tang, H. & Pevzner, P. A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res. 14, 2336–2346 (2004).

    CAS  Article  Google Scholar 

  17. 17

    Paten, B. et al. Cactus graphs for genome comparisons. J. Comput. Biol. 18, 469–481 (2011).

    CAS  Article  Google Scholar 

  18. 18

    Paten, B., Novak, A. & Haussler, D. Mapping to a reference genome structure. ArXiv (2014).

  19. 19

    Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).

    CAS  Article  Google Scholar 

  20. 20

    Garrison, E.P. & Marth, G. Haplotype-based variant detection from short-read sequencing. ArXiv (2012).

  21. 21

    Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).

    CAS  Article  Google Scholar 

  22. 22

    Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).

    Article  Google Scholar 

  23. 23

    Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

    CAS  Article  Google Scholar 

  24. 24

    Katoh, K. & Frith, M.C. Adding unaligned sequences into an existing alignment using MAFFT and LAST. Bioinformatics 28, 3144–3146 (2012).

    CAS  Article  Google Scholar 

  25. 25

    Bradley, R.K. et al. Fast statistical alignment. PLoS Comput. Biol. 5, e1000392 (2009).

    Article  Google Scholar 

  26. 26

    Lefranc, M.P. et al. IMGT, the international ImMunoGeneTics information system. Nucleic Acids Res. 37, D1006–D1012 (2009).

    CAS  Article  Google Scholar 

  27. 27

    Li, H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28, 1838–1844 (2012).

    CAS  Article  Google Scholar 

  28. 28

    Weisenfeld, N.I. et al. Comprehensive variation discovery in single human genomes. Nat. Genet. 46, 1350–1355 (2014).

    CAS  Article  Google Scholar 

  29. 29

    Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).

    CAS  Article  Google Scholar 

  30. 30

    Li, Y., Sidore, C., Kang, H.M., Boehnke, M. & Abecasis, G.R. Low-coverage sequencing: implications for design of complex trait association studies. Genome Res. 21, 940–951 (2011).

    CAS  Article  Google Scholar 

  31. 31

    Holdsworth, R. et al. The HLA dictionary 2008: a summary of HLA-A, -B, -C, -DRB1/3/4/5, and -DQB1 alleles and their association with serologically defined HLA-A, -B, -C, -DR, and -DQ antigens. Tissue Antigens 73, 95–170 (2009).

    CAS  Article  Google Scholar 

  32. 32

    Flicek, P. et al. Ensembl 2013. Nucleic Acids Res. 41, D48–D55 (2013).

    CAS  Article  Google Scholar 

  33. 33

    Spraggs, C.F., Parham, L.R., Hunt, C.M. & Dollery, C.T. Lapatinib-induced liver injury characterized by class II HLA and Gilbert's syndrome genotypes. Clin. Pharmacol. Ther. 91, 647–652 (2012).

    CAS  Article  Google Scholar 

Download references


We thank M. Eberle and colleagues at Illumina for early access to the Moleculo data. The study was funded by grants from GlaxoSmithKline and grant 100956/Z/13/Z from the Wellcome Trust to G.M., a Nuffield Department of Medicine Fellowship to Z.I. and a Sir Henry Dale Fellowship jointly awarded by the Wellcome Trust and the Royal Society to Z.I. (102541/Z/13/Z).

Author information




G.M. designed the experiment. A.D. and C.C. performed analyses. Z.I., M.R.N. and G.M. supervised the research. A.D. and G.M. wrote the manuscript with the assistance of co-authors.

Corresponding authors

Correspondence to Alexander Dilthey or Gil McVean.

Ethics declarations

Competing interests

C.C. and M.R.N. are employed by GlaxoSmithKline (GSK) and may own GSK stock. GSK does not sell or market any software or services related to genetic analysis or the generation of genetic data. G.M. is a founder and shareholder of Genomics, Ltd. G.M. and A.D. are partners in Peptide Groove, LLP.

Integrated supplementary information

Supplementary Figure 1 Relationship between the nucleotide and k-mer PRGs.

The nucleotide PRG is a directed, acyclic graph constructed from a multiple-sequence alignment reflecting variation within the aligned sequences. A k-mer PRG is constructed from the nucleotide PRG by enumerating the possible paths of length k and their relationship. A multi-PRG is generated by combining all non-branching stretches of levels in the k-mer PRG into single levels for the multi-PRG, with edges labeled with multiple k-mers.

Supplementary Figure 2 NA12878 k-mer recovery within classical HLA loci for four approaches.

Each panel shows the fraction of k-mers recovered at single-nucleotide resolution from chromotypes inferred by the four methods using the high-coverage data from NA12878. The average over the locus is also shown.

Supplementary Figure 3 Example of a non-MHC region with low k-mer recovery from mapping-based analysis.

k-mer recovery on chromosome 8 in a region containing multiple members of the ubiquitin-specific peptidase 17–like gene family where there are several 10-kb intervals where <90% of k-mers predicted to exist from the Platypus VCF are recovered from high-coverage sequencing data on NA12878.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–3, Supplementary Tables 1–9 and Supplementary Note (PDF 3181 kb)

Supplementary Data Set: Discrepancies between SNP array and PRG genotypes in NA12878.

Compressed (zip) file with screenshots showing read mapping at the 55 sites where the Viertbi-inferred genotype from the PRG disagrees with the SNP array genotype and where the PRG specifies a gap character. A manual evaluation of these sites is also provided as an Excel file. (ZIP 792 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dilthey, A., Cox, C., Iqbal, Z. et al. Improved genome inference in the MHC using a population reference graph. Nat Genet 47, 682–688 (2015).

Download citation

Further reading