Abstract
The human reference genome serves as the foundation for genomics by providing a scaffold for alignment of sequencing reads, but currently only reflects a single consensus haplotype, thus impairing analysis accuracy. Here we present a graph reference genome implementation that enables read alignment across 2,800 diploid genomes encompassing 12.6 million SNPs and 4.0 million insertions and deletions (indels). The pipeline processes one whole-genome sequencing sample in 6.5 h using a system with 36 CPU cores. We show that using a graph genome reference improves read mapping sensitivity and produces a 0.5% increase in variant calling recall, with unaffected specificity. Structural variations incorporated into a graph genome can be genotyped accurately under a unified framework. Finally, we show that iterative augmentation of graph genomes yields incremental gains in variant calling accuracy. Our implementation is an important advance toward fulfilling the promise of graph genomes to radically enhance the scalability and accuracy of genomic analyses.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Code availability
Graph Genome Pipeline is freely available to academic users for non-commercial use. Compiled standalone tools and the License of Use can be accessed at https://www.sevenbridges.com/graph-genome-academic-release/. The source code of the Graph Genome Pipeline tools is not publicly available.
Data availability
Raw sequencing data for the 150 Coriell WGS samples (Figs. 1, 4 and 5) can be accessed from the European Nucleotide Archive under accession PRJEB20654. Raw sequencing data for the Qatari samples (Fig. 5) used can be found under NCBI SRA accessions SRP060765, SRP061943 and SRP061463. Genome in a Bottle data (Fig. 3) are available from the NCBI FTP site (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data). The Sanger sequencing traces have been deposited in the European Nucleotide Archive under accession PRJEB26700.
References
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).
1000 Genomes Project Consortium. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Degner, J. F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).
Brandt, D. Y. C. et al. Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project Phase I data. G3 5, 931–941 (2015).
Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).
Antaki, D., Brandler, W. M. & Sebat, J. SV2: accurate structural variation genotyping and de novo mutation detection. Bioinformatics 34, 1774–1777 (2018).
Maretty, L. et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 548, 87–91 (2017).
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).
Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).
Paten, B., Novak, A. & Haussler, D. Mapping to a reference genome structure. arXiv [q-bio.GN] 1404.5010 (2014).
Novak, A. M. et al. Genome graphs. bioRxiv https://doi.org/10.1101/101378(2017).
Huang, L., Popic, V. & Batzoglou, S. Short read alignment with populations of genomes. Bioinformatics 29, i361–i370 (2013).
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).
Eggertsson, H. P. et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat. Genet. 49, 1654–1660 (2017).
Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
Sirén, J., Garrison, E., Novak, A. M., Paten, B. & Durbin, R. Haplotype-aware graph indexes. arXiv [cs.DS] 1805.03834 (2018).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv [q-bio.GN] 1303.3997v2 (2013).
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Michailidou, K. et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nat. Genet. 45, 353–361 (2013).
Berndt, S. I. et al. Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nat. Genet. 45, 501–512 (2013).
McVey, M. & Lee, S. E. MMEJ repair of double-strand breaks (director’s cut): deleted sequences and alternative endings. Trends Genet. 24, 529–538 (2008).
Wang, J., Raskin, L., Samuels, D. C., Shyr, Y. & Guo, Y. Genome measures used for quality control are dependent on gene function and ancestry. Bioinformatics 31, 318–323 (2015).
Fakhro, K. A. et al. The Qatar genome: a population-specific tool for precision medicine in the Middle East. Hum. Genome Var. 3, 16016 (2016).
Nho, K. et al. Comparison of multi-sample variant calling methods for whole genome sequencing. IEEE Int. Conf. Systems Biol. 2014, 59–62 (2014).
Novak, A. M., Garrison, E. & Paten, B. A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms Mol. Biol. 12, 18 (2017).
Huang, J. et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 6, 8111 (2015).
van Leeuwen, E. M. et al. Genome of The Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels. Nat. Commun. 6, 6065 (2015).
Nagasaki, M. et al. Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals. Nat. Commun. 6, 8018 (2015).
Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017).
Church, D. M. et al. Modernizing Reference Genome Assemblies. PLoS Biol. 9, e1001091 (2011).
Mills, R. E. et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 16, 1182–1190 (2006).
1000 Genomes Project Consortium. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Kural, D. Methods for Inter- and Intra-species Genomics for the Detection of Variation and Function. (Boston College Graduate School of Arts and Sciences, Boston, 2014).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv [q-bio.GN] 1207.3907 (2012).
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv https://doi.org/10.1101/201178 (2017).
Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (Cambridge University Press, 1998).
Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics. 43, 11.10.1–33 (2013).
Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. bioRxiv https://doi.org/10.1101/023754 (2015).
Acknowledgements
We are grateful for the members of the GA4GH Data Workgroup, Benchmarking, and Reference variation initiatives, in particular J. Zook, for insightful discussions and ideas. M. Huvet helped refine the treatment and presentation of ideas behind trio-based benchmarking. Research reported in this publication was supported in part by the UK Department of Health grant SBRI Genomics Competition: Enabling Technologies for Genomic Sequence Data Analysis and Interpretation administered by Genomics England.
Author information
Authors and Affiliations
Contributions
G.R., V.S., W.-P.L., J.S., A.D., B.P., A.J., and I.S. developed the algorithms and implemented the tools for graph genome alignment. J.B. and I.J.J. developed the algorithms and implemented the tools for variant calling. K.G. implemented the simulation experiments and carried out the benchmarks based on simulated data. V.A., J.N., A.J., and G.R. devised and carried out the experiments with population-specific genome graphs. P.K. and B.C.T. developed the computational tools used for benchmarks based on related genomes, and A.J. and M.C.S. carried out the experiments. S.-G.J., G.D., L.L., and P.K. created the genome graph containing the structural variants, designed, and carried out all of related experiments. M.P. created the machine learning–based variant filters and carried out the related experiments. I.G. and M.K. aided in interpreting the results and worked on the manuscript. Y.L., G.R., and D.K. prepared the manuscript with input from all other authors. D.K. conceived and oversaw the project with assistance from A.J., A.L.S., and M.K.
Corresponding author
Ethics declarations
Competing interests
G.R., J.S., V.A., J.N., M.C.S., G.D., L.L., B.C.T., B.P., I.S., I.G., P.K., A.L.S., Y.L., M.P., W.-P.L., M.K., and D.K. were employed by Seven Bridges Genomics Inc. during the development of the described tools. V.S., J.B., I.J.J., K.G., S.-G.J., A.D., and A.J. are current employees of Seven Bridges Genomics Inc. G.R., V.S., J.S., J.B., I.J.J., V.A., K.G., S.-G.J., L.L., I.S., P.K., A.L.S., Y.L., A.J., M.P. and D.K. hold shares, stock options or restricted stock units in Seven Bridges Genomics Inc. D.K. is co-inventor on 12 patents (issued: 14/016,833; 14/811,057; 15/196,345; 14/041,850 14/157,759; 14/157,979; published: 14/517,406; 14/517,419; 14/517,513; 14/517,451; 14/744,536; 14/798,686). V.S. is inventor on four patents (pending: 15/061,235; 14/885,192; 15/598,404; 15/597,464). W.-P.L. is co-inventor on three patents (published: 14/994,385, pending: 15/353,105; 15/007874). B.P., I.S. and A.J. are co-inventors on one patent (pending: 15/452,963). I.J.J is inventor on one patent (62/630,347). Applicant for patents is Seven Bridges Genomics Inc.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Text and Figures
Supplementary Note, Supplementary Tables 1–4, 6 and 17, and Supplementary Figures 1–22
Supplementary Table 5
Computational resource requirements of the Graph Genome Aligner and BWA-MEM
Supplementary Table 7
Precision FDA Truth Contest results vs. Graph Genome Pipeline
Supplementary Table 8
Variant calling benchmarking against genotyping using SNP arrays
Supplementary Table 9
Variant calling benchmarking results from simulated data
Supplementary Table 10
Genome in a Bottle benchmarking results
Supplementary Table 11
Trio benchmarking: inferred variant calling precision and recall
Supplementary Table 12
Trio benchmarking: Mendelian compliance rates with variant representation resolution
Supplementary Table 13
Trio benchmarking: Mendelian compliance rate without variant representation resolution
Supplementary Table 14
Validation of potentially false false positive variants in GiaB samples
Supplementary Table 15
Structure variation coordinates used in SV genotyping benchmarking experiments
Supplementary Table 16
Variant calling using global graph augmented by population-specific variants
Rights and permissions
About this article
Cite this article
Rakocevic, G., Semenyuk, V., Lee, WP. et al. Fast and accurate genomic analyses using genome graphs. Nat Genet 51, 354–362 (2019). https://doi.org/10.1038/s41588-018-0316-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-018-0316-4
This article is cited by
-
A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline
Genome Biology (2024)
-
Long-read sequencing and optical mapping generates near T2T assemblies that resolves a centromeric translocation
Scientific Reports (2024)
-
Pan-genome de Bruijn graph using the bidirectional FM-index
BMC Bioinformatics (2023)
-
A review of the pangenome: how it affects our understanding of genomic variation, selection and breeding in domestic animals?
Journal of Animal Science and Biotechnology (2023)
-
The pan-genome and local adaptation of Arabidopsis thaliana
Nature Communications (2023)