Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells

Journal name:
Nature
Volume:
487,
Pages:
190–195
Date published:
DOI:
doi:10.1038/nature11236
Received
Accepted
Published online

Abstract

Recent advances in whole-genome sequencing have brought the vision of personal genomics and genomic medicine closer to reality. However, current methods lack clinical accuracy and the ability to describe the context (haplotypes) in which genome variants co-occur in a cost-effective manner. Here we describe a low-cost DNA sequencing and haplotyping process, long fragment read (LFR) technology, which is similar to sequencing long single DNA molecules without cloning or separation of metaphase chromosomes. In this study, ten LFR libraries were made using only ~100picograms of human DNA per sample. Up to 97% of the heterozygous single nucleotide variants were assembled into long haplotype contigs. Removal of false positive single nucleotide variants not phased by multiple LFR haplotypes resulted in a final genome error rate of 1 in 10megabases. Cost-effective and accurate genome sequencing and haplotyping from 10–20 human cells, as demonstrated here, will enable comprehensive genetic studies and diverse clinical applications.

At a glance

Figures

  1. The LFR technology.
    Figure 1: The LFR technology.

    An overview of the LFR technology and controlled random enzymatic fragmenting is shown. (i) First, 100–130pg of high molecular mass (HMM) DNA is physically separated into 384 distinct wells; (ii) through several steps, all within the same well without intervening purifications, the genomic DNA is amplified, fragmented and ligated to unique barcode adapters; (iii) all 384 wells are combined, purified and introduced into the sequencing platform of Complete Genomics10; (iv) mate-paired reads are mapped to the genome using a custom alignment program and barcode sequences are used to group tags into haplotype contigs; and (v) the final result is a diploid genome sequence.

  2. LFR haplotyping algorithm.
    Figure 2: LFR haplotyping algorithm.

    a, Variation extraction. Variations are extracted from the aliquot tagged reads. The 10-base Reed–Solomon codes enable tag recovery by error correction. M denotes the number of genomic reads in the set (approximately 8billion); N denotes the number of the candidate heterozygous loci in the genome (~3 million). b, Heterozygous SNP pair connectivity evaluation. The matrix of shared aliquots is computed for each heterozygous SNP pair within a certain neighbourhood. Loop 1 is over all the heterozygous SNPs. Loop 2 is over all the heterozygous SNPs on the chromosome that are in the neighbourhood of the heterozygous SNPs in loop 1 (K). This neighbourhood is constrained by the expected number of heterozygous SNPs and the expected fragment lengths. c, Graph generation. An undirected graph is made, with nodes corresponding to the heterozygous SNPs and the connections corresponding to the orientation and the strength of the best hypothesis for the relationship between those SNPs. The orientation is binary and is shown in the figure with a colour. Red and green depict a flipped and unflipped relationship between heterozygous SNP pairs, respectively. The strength is defined by using fuzzy logic operations on the elements of the shared aliquot matrix. d, Graph optimization. The graph is optimized by a minimum spanning tree operation. e, Contig generation. Each sub-tree is reduced to a contig by keeping the first heterozygous SNP unchanged, and flipping or not flipping the other heterozygous SNPs on the sub-tree, based on their paths to the first heterozygous SNP. The designation of parent 1 (P1) and parent 2 (P2) to each contig is arbitrary. The gaps in the chromosome-wide tree define the boundaries for different sub-trees/contigs on that chromosome. f, Optional mapping of LFR contigs to parental chromosomes. Using parental information, a ‘mother’ or ‘father’ label is placed on the P1 and P2 haplotypes of each contig.

Accession codes

Primary accessions

Sequence Read Archive

References

  1. Human. genome: Genomes by the thousand. Nature 467, 10261027 (2010)
  2. Wheeler, D. A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872876 (2008)
  3. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 6065 (2008)
  4. Ley, T. J. et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 6672 (2008)
  5. Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 5359 (2008)
  6. Ahn, S. M. et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 19, 16221629 (2009)
  7. Kim, J. I. et al. A highly annotated whole-genome sequence of a Korean individual. Nature 460, 10111015 (2009)
  8. McKernan, K. J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 19, 15271541 (2009)
  9. Pushkarev, D., Neff, N. F. & Quake, S. R. Single-molecule sequencing of an individual human genome. Nature Biotechnol. 27, 847850 (2009)
  10. Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 7881 (2010)
  11. Kitzman, J. O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nature Biotechnol. 29, 5963 (2011)
  12. Rothberg, J. M. et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature 475, 348352 (2011)
  13. Suk, E. K. et al. A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res. 21, 16721685 (2011)
  14. Venter, J. C. et al. The sequence of the human genome. Science 291, 13041351 (2001)
  15. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860921 (2001)
  16. Tewhey, R., Bansal, V., Torkamani, A., Topol, E. J. & Schork, N. J. The importance of phase information for human genomics. Nature Rev. Genet. 12, 215223 (2011)
  17. Browning, S. R. & Browning, B. L. Haplotype phasing: existing methods and new developments. Nature Rev. Genet. 12, 703714 (2011)
  18. Roach, J. C. et al. Chromosomal haplotypes by genetic phasing of human families. Am. J. Hum. Genet. 89, 382397 (2011)
  19. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007)
  20. Duitama, J. et al. Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques. Nucleic Acids Res. 40, 20412053 (2012)
  21. Zhang, K. et al. Long-range polony haplotyping of individual human chromosome molecules. Nature Genet. 38, 382387 (2006)
  22. Ma, L. et al. Direct determination of molecular haplotypes by chromosome microdissection. Nature Methods 7, 299301 (2010)
  23. Fan, H. C., Wang, J., Potanina, A. & Quake, S. R. Whole-genome molecular haplotyping of single cells. Nature Biotechnol. 29, 5157 (2011)
  24. Yang, H., Chen, X. & Wong, W. H. Completely phased genome sequencing through chromosome sorting. Proc. Natl Acad. Sci. USA 108, 1217 (2011)
  25. Drmanac, R. Nucleic acid analysis by random mixtures of non-overlapping fragments. US patent 7,901. 891 (2006)
  26. Dean, F. B. et al. Comprehensive human genome amplification using multiple displacement amplification. Proc. Natl Acad. Sci. USA 99, 52615266 (2002)
  27. Kermani, B. G. & Shannon, K. W. Method and apparatus for quantification of DNA sequencing quality and construction of a characterizable model system using Reed–Solomon codes. US patent PCT/US2010/023083. (2010)
  28. The International HapMap Consortium A haplotype map of the human genome. Nature 437, 12991320 (2005)
  29. Frazer, K. A. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851861 (2007)
  30. The 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature 467, 10611073 (2010)
  31. Carnevali, P. et al. Computational techniques for human genome resequencing using mated gapped reads. J. Comput. Biol. 19, 279292 (2011)
  32. Conrad, D. F. et al. Variation in genome-wide mutation rates within and between human families. Nature Genet. 43, 712714 (2011)
  33. Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nature Methods 7, 248249 (2010)
  34. MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823828 (2012)
  35. Lohmueller, K. E. et al. Proportionally more deleterious genetic variation in European than in African populations. Nature 451, 994997 (2008)
  36. Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W. W. & Lenhard, B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32, D91D94 (2004)
  37. Bryne, J. C. et al. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res. 36, D102D106 (2008)

Download references

Author information

  1. These authors contributed equally to this work.

    • Brock A. Peters &
    • Bahram G. Kermani

Affiliations

  1. Complete Genomics, Inc., 2071 Stierlin Court, Mountain View, California 94043, USA

    • Brock A. Peters,
    • Bahram G. Kermani,
    • Andrew B. Sparks,
    • Oleg Alferov,
    • Peter Hong,
    • Andrei Alexeev,
    • Yuan Jiang,
    • Fredrik Dahl,
    • Y. Tom Tang,
    • Juergen Haas,
    • Joseph E. Peterson,
    • Helena Perazich,
    • George Yeung,
    • Jia Liu,
    • Linsu Chen,
    • Michael I. Kennemer,
    • Kaliprasad Pothuraju,
    • Karel Konvicka,
    • Mike Tsoupko-Sitnikov,
    • Krishna P. Pant,
    • Jessica C. Ebert,
    • Geoffrey B. Nilsen,
    • Jonathan Baccash,
    • Aaron L. Halpern &
    • Radoje Drmanac
  2. Department of Genetics, Harvard Medical School, Cambridge, Massachusetts 02115, USA

    • Kimberly Robasky,
    • Alexander Wait Zaranek,
    • Je-Hyuk Lee,
    • Madeleine Price Ball &
    • George M. Church
  3. Program in Bioinformatics, Boston University, Boston, Massachusetts 02215, USA

    • Kimberly Robasky
  4. Wyss Institute for Biologically Inspired Engineering, Harvard Medical School, Cambridge, Massachusetts 02115, USA

    • Je-Hyuk Lee
  5. Present addresses: Aria Diagnostics, 5945 Optical Court, San Jose, California 95138, USA (A.B.S.); Halo Genomics, Dag Hammarskjolds vag 54A, 751 83 Uppsala, Sweden (F.D.).

    • Andrew B. Sparks &
    • Fredrik Dahl

Contributions

B.A.P., B.G.K., A.B.S. and R.D. conceived the study. B.A.P., B.G.K., R.D., O.A., Y.T.T., J.H., J.C.E., J.B., A.L.H. and G.B.N. performed analyses. B.A.P., A.B.S., P.H., A.A., Y.J., F.D., J.E.P., H.P., G.Y., J.L. and L.C. developed the laboratory processes and generated the LFR libraries. K.K., M.T.-S. and K.P.P. developed the basecaller and parts of the analysis pipeline. M.I.K. formatted, managed and uploaded data to the public archives. K.R., A.W.Z., J.-H.L., M.P.B. and G.M.C. generated and analysed the RNA sequencing data. B.A.P., B.G.K. and R.D. coordinated the study and wrote the paper. All authors contributed to revision and review of the manuscript.

Competing financial interests

Employees of Complete Genomics have stock options in the company; Complete Genomics has filed several patents on this work.

Corresponding authors

Correspondence to:

Tagged read data has been deposited with the NCBI short-read archive under accession number SRP012316 All sequence data and haplotype information for LFR libraries generated in this study are also available at http://www.completegenomics.com/LFR.

Author details

Supplementary information

PDF files

  1. Supplementary Information (2.6M)

    This file contains Supplementary Figures 1-12, Supplementary Material with additional references, Supplementary Methods with additional Figures 1-14 and Supplementary Tables 1-13.

Additional data