Improved genome inference in the MHC using a population reference graph

Dilthey, Alexander; Cox, Charles; Iqbal, Zamin; Nelson, Matthew R; McVean, Gil

doi:10.1038/ng.3257

Technical Report
Published: 27 April 2015

Improved genome inference in the MHC using a population reference graph

Alexander Dilthey¹,
Charles Cox²,
Zamin Iqbal¹,
Matthew R Nelson³ &
…
Gil McVean¹

Nature Genetics volume 47, pages 682–688 (2015)Cite this article

9701 Accesses
114 Citations
85 Altmetric
Metrics details

Subjects

Genome assembly algorithms

Abstract

Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Read mapping in the MHC class II region.**

**Figure 2: Schematic showing the construction and application of a PRG.**

**Figure 3: Simulation study and empirical validation.**

**Figure 4: Recovery of chromotype k-mers from HTS data.**

**Figure 5: Spatial recovery of k-mers within the HLA class II region.**

**Figure 6: Alignment of synthetic long-read data to chromotypes.**

Pangenome graph construction from genome alignments with Minigraph-Cactus

Article 10 May 2023

A draft human pangenome reference

Article Open access 10 May 2023

Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes

Article Open access 11 April 2022

Accession codes

Accessions

NCBI Reference Sequence

References

Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).
Article Google Scholar
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).
Article CAS Google Scholar
Lunter, G. & Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21, 936–939 (2011).
Article CAS Google Scholar
Zook, J.M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Article CAS Google Scholar
Horton, R. et al. Variation analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype Project. Immunogenetics 60, 1–18 (2008).
Article CAS Google Scholar
Jiang, W. et al. Copy number variation leads to considerable diversity for B but not A haplotypes of the human KIR genes encoding NK cell receptors. Genome Res. 22, 1845–1854 (2012).
Article CAS Google Scholar
Trask, B.J. et al. Large multi-chromosomal duplications encompass many members of the olfactory receptor gene family in the human genome. Hum. Mol. Genet. 7, 2007–2020 (1998).
Article CAS Google Scholar
Steinberg, K.M. et al. Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat. Genet. 44, 872–880 (2012).
Article CAS Google Scholar
Boettger, L.M., Handsaker, R.E., Zody, M.C. & McCarroll, S.A. Structural haplotypes and recent evolution of the human 17q21.31 region. Nat. Genet. 44, 881–885 (2012).
Article CAS Google Scholar
Stefansson, H. et al. A common inversion under selection in Europeans. Nat. Genet. 37, 129–137 (2005).
Article CAS Google Scholar
Lupski, J.R. & Stankiewicz, P. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. PLoS Genet. 1, e49 (2005).
Article Google Scholar
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Lee, C., Grasso, C. & Sharlow, M.F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).
Article CAS Google Scholar
Raphael, B., Zhi, D., Tang, H. & Pevzner, P. A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res. 14, 2336–2346 (2004).
Article CAS Google Scholar
Paten, B. et al. Cactus graphs for genome comparisons. J. Comput. Biol. 18, 469–481 (2011).
Article CAS Google Scholar
Paten, B., Novak, A. & Haussler, D. Mapping to a reference genome structure. ArXiv http://arxiv.org/abs/1404.5010 (2014).
Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).
Article CAS Google Scholar
Garrison, E.P. & Marth, G. Haplotype-based variant detection from short-read sequencing. ArXiv http://arxiv.org/abs/1207.3907 (2012).
Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).
Article CAS Google Scholar
Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).
Article Google Scholar
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Article CAS Google Scholar
Katoh, K. & Frith, M.C. Adding unaligned sequences into an existing alignment using MAFFT and LAST. Bioinformatics 28, 3144–3146 (2012).
Article CAS Google Scholar
Bradley, R.K. et al. Fast statistical alignment. PLoS Comput. Biol. 5, e1000392 (2009).
Article Google Scholar
Lefranc, M.P. et al. IMGT, the international ImMunoGeneTics information system. Nucleic Acids Res. 37, D1006–D1012 (2009).
Article CAS Google Scholar
Li, H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28, 1838–1844 (2012).
Article CAS Google Scholar
Weisenfeld, N.I. et al. Comprehensive variation discovery in single human genomes. Nat. Genet. 46, 1350–1355 (2014).
Article CAS Google Scholar
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).
Article CAS Google Scholar
Li, Y., Sidore, C., Kang, H.M., Boehnke, M. & Abecasis, G.R. Low-coverage sequencing: implications for design of complex trait association studies. Genome Res. 21, 940–951 (2011).
Article CAS Google Scholar
Holdsworth, R. et al. The HLA dictionary 2008: a summary of HLA-A, -B, -C, -DRB1/3/4/5, and -DQB1 alleles and their association with serologically defined HLA-A, -B, -C, -DR, and -DQ antigens. Tissue Antigens 73, 95–170 (2009).
Article CAS Google Scholar
Flicek, P. et al. Ensembl 2013. Nucleic Acids Res. 41, D48–D55 (2013).
Article CAS Google Scholar
Spraggs, C.F., Parham, L.R., Hunt, C.M. & Dollery, C.T. Lapatinib-induced liver injury characterized by class II HLA and Gilbert's syndrome genotypes. Clin. Pharmacol. Ther. 91, 647–652 (2012).
Article CAS Google Scholar

Download references

Acknowledgements

We thank M. Eberle and colleagues at Illumina for early access to the Moleculo data. The study was funded by grants from GlaxoSmithKline and grant 100956/Z/13/Z from the Wellcome Trust to G.M., a Nuffield Department of Medicine Fellowship to Z.I. and a Sir Henry Dale Fellowship jointly awarded by the Wellcome Trust and the Royal Society to Z.I. (102541/Z/13/Z).

Author information

Authors and Affiliations

Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
Alexander Dilthey, Zamin Iqbal & Gil McVean
Department of Quantitative Sciences, GlaxoSmithKline, Stevenage, UK
Charles Cox
Department of Quantitative Sciences, GlaxoSmithKline, Research Triangle Park, North Carolina, USA
Matthew R Nelson

Authors

Alexander Dilthey
View author publications
You can also search for this author in PubMed Google Scholar
Charles Cox
View author publications
You can also search for this author in PubMed Google Scholar
Zamin Iqbal
View author publications
You can also search for this author in PubMed Google Scholar
Matthew R Nelson
View author publications
You can also search for this author in PubMed Google Scholar
Gil McVean
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.M. designed the experiment. A.D. and C.C. performed analyses. Z.I., M.R.N. and G.M. supervised the research. A.D. and G.M. wrote the manuscript with the assistance of co-authors.

Corresponding authors

Correspondence to Alexander Dilthey or Gil McVean.

Ethics declarations

Competing interests

C.C. and M.R.N. are employed by GlaxoSmithKline (GSK) and may own GSK stock. GSK does not sell or market any software or services related to genetic analysis or the generation of genetic data. G.M. is a founder and shareholder of Genomics, Ltd. G.M. and A.D. are partners in Peptide Groove, LLP.

Integrated supplementary information

Supplementary Figure 1 Relationship between the nucleotide and k-mer PRGs.

The nucleotide PRG is a directed, acyclic graph constructed from a multiple-sequence alignment reflecting variation within the aligned sequences. A k-mer PRG is constructed from the nucleotide PRG by enumerating the possible paths of length k and their relationship. A multi-PRG is generated by combining all non-branching stretches of levels in the k-mer PRG into single levels for the multi-PRG, with edges labeled with multiple k-mers.

Supplementary Figure 2 NA12878 k-mer recovery within classical HLA loci for four approaches.

Each panel shows the fraction of k-mers recovered at single-nucleotide resolution from chromotypes inferred by the four methods using the high-coverage data from NA12878. The average over the locus is also shown.

Supplementary Figure 3 Example of a non-MHC region with low k-mer recovery from mapping-based analysis.

k-mer recovery on chromosome 8 in a region containing multiple members of the ubiquitin-specific peptidase 17–like gene family where there are several 10-kb intervals where <90% of k-mers predicted to exist from the Platypus VCF are recovered from high-coverage sequencing data on NA12878.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–3, Supplementary Tables 1–9 and Supplementary Note (PDF 3181 kb)

Supplementary Data Set: Discrepancies between SNP array and PRG genotypes in NA12878.

Compressed (zip) file with screenshots showing read mapping at the 55 sites where the Viertbi-inferred genotype from the PRG disagrees with the SNP array genotype and where the PRG specifies a gap character. A manual evaluation of these sites is also provided as an Excel file. (ZIP 792 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dilthey, A., Cox, C., Iqbal, Z. et al. Improved genome inference in the MHC using a population reference graph. Nat Genet 47, 682–688 (2015). https://doi.org/10.1038/ng.3257

Download citation

Received: 02 July 2014
Accepted: 03 March 2015
Published: 27 April 2015
Issue Date: June 2015
DOI: https://doi.org/10.1038/ng.3257

This article is cited by

CRISPR-based targeted haplotype-resolved assembly of a megabase region
- Taotao Li
- Duo Du
- Yun Liu
Nature Communications (2023)
Quantitative proteomics analysis to assess protein expression levels in the ovaries of pubescent goats
- Ping Qin
- Jing Ye
- Fugui Fang
BMC Genomics (2022)
Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes
- Jana Ebler
- Peter Ebert
- Tobias Marschall
Nature Genetics (2022)
HLA imputation and its application to genetic and molecular fine-mapping of the MHC region in autoimmune diseases
- Tatsuhiko Naito
- Yukinori Okada
Seminars in Immunopathology (2022)
Computational graph pangenomics: a tutorial on data structures and their applications
- Jasmijn A. Baaijens
- Paola Bonizzoni
- Jouni Sirén
Natural Computing (2022)