Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.
Subscribe to Journal
Get full journal access for 1 year
only $18.75 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).
Lunter, G. & Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21, 936–939 (2011).
Zook, J.M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Horton, R. et al. Variation analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype Project. Immunogenetics 60, 1–18 (2008).
Jiang, W. et al. Copy number variation leads to considerable diversity for B but not A haplotypes of the human KIR genes encoding NK cell receptors. Genome Res. 22, 1845–1854 (2012).
Trask, B.J. et al. Large multi-chromosomal duplications encompass many members of the olfactory receptor gene family in the human genome. Hum. Mol. Genet. 7, 2007–2020 (1998).
Steinberg, K.M. et al. Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat. Genet. 44, 872–880 (2012).
Boettger, L.M., Handsaker, R.E., Zody, M.C. & McCarroll, S.A. Structural haplotypes and recent evolution of the human 17q21.31 region. Nat. Genet. 44, 881–885 (2012).
Stefansson, H. et al. A common inversion under selection in Europeans. Nat. Genet. 37, 129–137 (2005).
Lupski, J.R. & Stankiewicz, P. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. PLoS Genet. 1, e49 (2005).
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Lee, C., Grasso, C. & Sharlow, M.F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).
Raphael, B., Zhi, D., Tang, H. & Pevzner, P. A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res. 14, 2336–2346 (2004).
Paten, B. et al. Cactus graphs for genome comparisons. J. Comput. Biol. 18, 469–481 (2011).
Paten, B., Novak, A. & Haussler, D. Mapping to a reference genome structure. ArXiv http://arxiv.org/abs/1404.5010 (2014).
Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).
Garrison, E.P. & Marth, G. Haplotype-based variant detection from short-read sequencing. ArXiv http://arxiv.org/abs/1207.3907 (2012).
Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).
Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Katoh, K. & Frith, M.C. Adding unaligned sequences into an existing alignment using MAFFT and LAST. Bioinformatics 28, 3144–3146 (2012).
Bradley, R.K. et al. Fast statistical alignment. PLoS Comput. Biol. 5, e1000392 (2009).
Lefranc, M.P. et al. IMGT, the international ImMunoGeneTics information system. Nucleic Acids Res. 37, D1006–D1012 (2009).
Li, H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28, 1838–1844 (2012).
Weisenfeld, N.I. et al. Comprehensive variation discovery in single human genomes. Nat. Genet. 46, 1350–1355 (2014).
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).
Li, Y., Sidore, C., Kang, H.M., Boehnke, M. & Abecasis, G.R. Low-coverage sequencing: implications for design of complex trait association studies. Genome Res. 21, 940–951 (2011).
Holdsworth, R. et al. The HLA dictionary 2008: a summary of HLA-A, -B, -C, -DRB1/3/4/5, and -DQB1 alleles and their association with serologically defined HLA-A, -B, -C, -DR, and -DQ antigens. Tissue Antigens 73, 95–170 (2009).
Flicek, P. et al. Ensembl 2013. Nucleic Acids Res. 41, D48–D55 (2013).
Spraggs, C.F., Parham, L.R., Hunt, C.M. & Dollery, C.T. Lapatinib-induced liver injury characterized by class II HLA and Gilbert's syndrome genotypes. Clin. Pharmacol. Ther. 91, 647–652 (2012).
We thank M. Eberle and colleagues at Illumina for early access to the Moleculo data. The study was funded by grants from GlaxoSmithKline and grant 100956/Z/13/Z from the Wellcome Trust to G.M., a Nuffield Department of Medicine Fellowship to Z.I. and a Sir Henry Dale Fellowship jointly awarded by the Wellcome Trust and the Royal Society to Z.I. (102541/Z/13/Z).
C.C. and M.R.N. are employed by GlaxoSmithKline (GSK) and may own GSK stock. GSK does not sell or market any software or services related to genetic analysis or the generation of genetic data. G.M. is a founder and shareholder of Genomics, Ltd. G.M. and A.D. are partners in Peptide Groove, LLP.
Integrated supplementary information
The nucleotide PRG is a directed, acyclic graph constructed from a multiple-sequence alignment reflecting variation within the aligned sequences. A k-mer PRG is constructed from the nucleotide PRG by enumerating the possible paths of length k and their relationship. A multi-PRG is generated by combining all non-branching stretches of levels in the k-mer PRG into single levels for the multi-PRG, with edges labeled with multiple k-mers.
Each panel shows the fraction of k-mers recovered at single-nucleotide resolution from chromotypes inferred by the four methods using the high-coverage data from NA12878. The average over the locus is also shown.
Supplementary Figure 3 Example of a non-MHC region with low k-mer recovery from mapping-based analysis.
k-mer recovery on chromosome 8 in a region containing multiple members of the ubiquitin-specific peptidase 17–like gene family where there are several 10-kb intervals where <90% of k-mers predicted to exist from the Platypus VCF are recovered from high-coverage sequencing data on NA12878.
Supplementary Figures 1–3, Supplementary Tables 1–9 and Supplementary Note (PDF 3181 kb)
Compressed (zip) file with screenshots showing read mapping at the 55 sites where the Viertbi-inferred genotype from the PRG disagrees with the SNP array genotype and where the PRG specifies a gap character. A manual evaluation of these sites is also provided as an Excel file. (ZIP 792 kb)
About this article
Cite this article
Dilthey, A., Cox, C., Iqbal, Z. et al. Improved genome inference in the MHC using a population reference graph. Nat Genet 47, 682–688 (2015). https://doi.org/10.1038/ng.3257
Genome Biology (2020)
Allele-specific expression changes dynamically during T cell activation in HLA and other autoimmune loci
Nature Genetics (2020)
Human Immunology (2020)
Rapid, highly accurate and cost‐effective open‐source simultaneous complete HLA typing and phasing of class I and II alleles using nanopore sequencing
European Journal of Human Genetics (2020)