Abstract
The rapid growth of sequencing technologies has greatly contributed to our understanding of human genetics. Yet, despite this growth, mainstream technologies have not been fully able to resolve the diploid nature of the human genome. Here we describe statistically aided, long-read haplotyping (SLRH), a rapid, accurate method that uses a statistical algorithm to take advantage of the partially phased information contained in long genomic fragments analyzed by short-read sequencing. For a human sample, as little as 30 Gbp of additional sequencing data are needed to phase genotypes identified by 50× coverage whole-genome sequencing. Using SLRH, we phase 99% of single-nucleotide variants in three human genomes into long haplotype blocks 0.2–1 Mbp in length. We apply our method to determine allele-specific methylation patterns in a human genome and identify hundreds of differentially methylated regions that were previously unknown. SLRH should facilitate population-scale haplotyping of human genomes.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Accession codes
References
Tewhey, R., Bansal, V., Torkamani, A., Topol, E.J. & Schork, N.J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).
Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).
Roach, J.C. et al. Chromosomal haplotypes by genetic phasing of human families. Am. J. Hum. Genet. 89, 382–397 (2011).
Fan, H.C., Wang, J., Potanina, A. & Quake, S.R. Whole-genome molecular haplotyping of single cells. Nat. Biotechnol. 29, 51–57 (2011).
Yang, H., Chen, X. & Wong, W.H. Completely phased genome sequencing through chromosome sorting. Proc. Natl. Acad. Sci. USA 108, 12–17 (2011).
Selvaraj, S., Dixon, R.J., Bansal, V. & Ren, B. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat. Biotechnol. 31, 1111–1118 (2013).
Kitzman, J.O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).
Duitama, J. et al. Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques. Nucleic Acids Res. 40, 2041–2053 (2012).
Ruano, G., Kidd, K.K. & Stephens, J.C. Haplotype of multiple polymorphisms resolved by enzymatic amplification of single DNA molecules. Proc. Natl. Acad. Sci. USA 87, 6296–6300 (1990).
Jeffreys, A.J., Neumann, R. & Wilson, V. Repeat unit sequence variation in minisatellites: a novel source of DNA polymorphism for studying variation and mutation by single molecule analysis. Cell 60, 473–485 (1990).
Peters, B.A. et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487, 190–195 (2012).
Kaper, F. et al. Whole-genome haplotyping by dilution, amplification, and sequencing. Proc. Natl. Acad. Sci. USA 110, 5552–5557 (2013).
Voskoboynik, A. et al. The genome sequence of the colonial chordate, Botryllus schlosseri. eLife 2, e00569 (2013).
Daelemans, C. et al. High-throughput analysis of candidate imprinted genes and allele-specific gene expression in the human term placenta. BMC Genet. 11, 25 (2010).
Suk, E. et al. A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res. 21, 1672–1685 (2011).
The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).
Howie, B.N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).
Delaneau, O., Zagury, J. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).
Delaneau, O., Howie, B., Cox, A.J., Zagury, J. & Marchini, J. Haplotype estimation using sequencing reads. Am. J. Hum. Genet. 93, 687–696 (2013).
Hsu, F. et al. The UCSC Known Genes. Bioinformatics 22, 1036–1046 (2006).
Kumar, P., Henikoff, S. & Ng, P.C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 4, 1073–1081 (2009).
Edwards, C.A. & Ferguson-Smith, A.C. Mechanisms regulating imprinted genes in clusters. Curr. Opin. Cell Biol. 19, 281–289 (2007).
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Gertz, J. et al. Analysis of DNA methylation in a three-generation family reveals widespread genetic influence on epigenetic regulation. PLoS Genet. 7, e1002228 (2011).
Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).
Li, Y. et al. The DNA methylome of human peripheral blood mononuclear cells. PLoS Biol. 8, e1000533 (2010).
Welch, K.O., Marin, R.S., Pandya, A. & Arnos, K.S. Compound heterozygosity for dominant and recessive GJB2 mutations: effect on phenotype and review of the literature. Am. J. Med. Genet. A. 143A, 1567–1573 (2007).
Fong, C.Y.I., Mumford, A.D., Likeman, M.J. & Jardine, P.E. Cerebral palsy in siblings caused by compound heterozygous mutations in the gene encoding protein C. Dev. Med. Child Neurol. 52, 489–493 (2010).
Shimizu, H. et al. Epidermolysis bullosa simplex associated with muscular dystrophy: phenotype-genotype correlations and review of the literature. J. Am. Acad. Dermatol. 41, 950–956 (1999).
Green, R.E. et al. A draft sequence of the Neandertal genome. Science 328, 710–722 (2010).
Bansal, V. & Bafna, V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153–i159 (2008).
Acknowledgements
We thank C. Pan for assistance in coordinating contacts and discussions. This work is funded by US National Institutes of Health grants HL107393-02, HG004558-05, and the Genetics Department of Stanford University.
Author information
Authors and Affiliations
Contributions
D.P. and M.K. developed the laboratory preparation protocol. V.K. developed the Prism phasing algorithm. Z.M. and R.C. performed the Methyl-seq experiments. T.B. prepared the phasing libraries. V.K., D.X. and D.P. performed computational analysis. V.K., D.X. and M.S. wrote the manuscript. R.C. and M.K. reviewed and revised the manuscript. M.K. and M.S. supervised the research.
Corresponding author
Ethics declarations
Competing interests
V.K., D.P., T.B. and M.K. performed the research at Moleculo Inc. (acquired by Illumina Inc.). D.P., T.B. and M.K. are employed by Illumina Inc. and V.K. is a consultant to Illumina Inc. M.K., D.P., T.B. and V.K. are listed as inventors on a patent filed for the SLRH technology. The library preparation protocol is covered by US and international patents with numbers 61/532,882 and 13/608,778 on which D.P. and M.K. are listed as inventors. The SLRH technology is offered commercially by Illumina Inc.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–5, Supplementary Tables 1–15 and Supplementary Note (PDF 1431 kb)
List of DMRs
List of DMRs (XLSX 106 kb)
Prism source code
Prism Source Code (ZIP 181 kb)
Rights and permissions
About this article
Cite this article
Kuleshov, V., Xie, D., Chen, R. et al. Whole-genome haplotyping using long reads and statistical methods. Nat Biotechnol 32, 261–266 (2014). https://doi.org/10.1038/nbt.2833
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nbt.2833
This article is cited by
-
DeepLoop robustly maps chromatin interactions from sparse allele-resolved or single-cell Hi-C data at kilobase resolution
Nature Genetics (2022)
-
SLR-superscaffolder: a de novo scaffolding tool for synthetic long reads using a top-to-bottom scheme
BMC Bioinformatics (2021)
-
Targeted transcriptome analysis using synthetic long read sequencing uncovers isoform reprograming in the progression of colon cancer
Communications Biology (2021)
-
Nucleic Acids Analysis
Science China Chemistry (2021)
-
Unlinked rRNA genes are widespread among bacteria and archaea
The ISME Journal (2020)