Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells

Peters, Brock A.; Kermani, Bahram G.; Sparks, Andrew B.; Alferov, Oleg; Hong, Peter; Alexeev, Andrei; Jiang, Yuan; Dahl, Fredrik; Tang, Y. Tom; Haas, Juergen; Robasky, Kimberly; Zaranek, Alexander Wait; Lee, Je-Hyuk; Ball, Madeleine Price; Peterson, Joseph E.; Perazich, Helena; Yeung, George; Liu, Jia; Chen, Linsu; Kennemer, Michael I.; Pothuraju, Kaliprasad; Konvicka, Karel; Tsoupko-Sitnikov, Mike; Pant, Krishna P.; Ebert, Jessica C.; Nilsen, Geoffrey B.; Baccash, Jonathan; Halpern, Aaron L.; Church, George M.; Drmanac, Radoje

doi:10.1038/nature11236

Download PDF

Article
Open access
Published: 11 July 2012

Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells

Brock A. Peters¹^na1,
Bahram G. Kermani¹^na1,
Andrew B. Sparks¹^nAff6,
Oleg Alferov¹,
Peter Hong¹,
Andrei Alexeev¹,
Yuan Jiang¹,
Fredrik Dahl¹^nAff6,
Y. Tom Tang¹,
Juergen Haas¹,
Kimberly Robasky^2,3,
Alexander Wait Zaranek²,
Je-Hyuk Lee^2,4,
Madeleine Price Ball²,
Joseph E. Peterson¹,
Helena Perazich¹,
George Yeung¹,
Jia Liu¹,
Linsu Chen¹,
Michael I. Kennemer¹,
Kaliprasad Pothuraju¹,
Karel Konvicka¹,
Mike Tsoupko-Sitnikov¹,
Krishna P. Pant¹,
Jessica C. Ebert¹,
Geoffrey B. Nilsen¹,
Jonathan Baccash¹,
Aaron L. Halpern¹,
George M. Church² &
…
Radoje Drmanac¹

Nature volume 487, pages 190–195 (2012)Cite this article

20k Accesses
175 Citations
203 Altmetric
Metrics details

Subjects

Abstract

Recent advances in whole-genome sequencing have brought the vision of personal genomics and genomic medicine closer to reality. However, current methods lack clinical accuracy and the ability to describe the context (haplotypes) in which genome variants co-occur in a cost-effective manner. Here we describe a low-cost DNA sequencing and haplotyping process, long fragment read (LFR) technology, which is similar to sequencing long single DNA molecules without cloning or separation of metaphase chromosomes. In this study, ten LFR libraries were made using only ∼100 picograms of human DNA per sample. Up to 97% of the heterozygous single nucleotide variants were assembled into long haplotype contigs. Removal of false positive single nucleotide variants not phased by multiple LFR haplotypes resulted in a final genome error rate of 1 in 10 megabases. Cost-effective and accurate genome sequencing and haplotyping from 10–20 human cells, as demonstrated here, will enable comprehensive genetic studies and diverse clinical applications.

A single-cell atlas enables mapping of homeostatic cellular shifts in the adult human breast

Article Open access 28 March 2024

Austin D. Reed, Sara Pensa, … Walid T. Khaled

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Saori Sakaue, Kathryn Weinand, … Soumya Raychaudhuri

Main

The extraordinary advancements made in DNA sequencing technologies over the past few years have led to the elucidation of ∼10,000 (refs 1–13) individual human genomes (30× or greater base coverage) from different ethnicities and using different technologies^{2,3,4,5,6,7,8,9,10,11,12,13} and at a fraction of the cost¹⁰ of sequencing the original human reference genome^14,15. Although this is a monumental achievement, the vast majority of these genomes have excluded a very important element of human genetics. Individual human genomes are diploid in nature, with half of the homologous chromosomes being derived from each parent. The context in which variations occur on each individual chromosome can have profound effects on the expression and regulation of genes and other transcribed regions of the genome¹⁶. Furthermore, determining whether two potentially detrimental mutations occur within one or both alleles of a gene is of paramount clinical importance.

Almost all recent human genome sequencing has been performed on short read length (<200 base pairs (bp)), highly parallelized systems starting with hundreds of nanograms of DNA. These technologies are excellent at generating large volumes of data quickly and economically. Unfortunately, short reads, often paired with small mate-gap sizes (500 bases–10 kilobases (kb)), eliminate most single nucleotide polymorphism (SNP) phase information beyond a few kilobases⁸. Population-based genotype data has been used to successfully assemble short-read data into long haplotype blocks³, but these methods suffer from higher error rates and have difficulty phasing rare variants¹⁷. Although using pedigree information¹⁸ or combining it with population data provides further phasing power, no combination of these methods is able to resolve de novo mutations¹⁷.

At present, four personal genomes—J. Craig Venter¹⁹, a Gujarati Indian (HapMap sample NA20847)¹¹, and two Europeans (Max Planck One¹³ and HapMap Sample NA12878 (ref. 20))—have been sequenced and assembled as diploid. All have involved cloning long DNA fragments in a process similar to that used for the construction of the human reference genome^14,15. Although these processes generate long-phased contigs (N50 values (50% of the covered bases are found within contigs longer than this number) of 350 kb¹⁹, 386 kb¹¹ and 1 megabase (Mb)¹³, and full-chromosome haplotypes in combination with parental genotypes²⁰) they require a large amount of initial DNA, extensive library processing, and are currently too expensive¹¹ to use in a routine clinical environment. Furthermore, several reports have recently demonstrated whole chromosome haplotyping through direct isolation of metaphase chromosomes^21,22,23,24. These methods have yet to be used for whole-genome sequencing and require preparation and isolation of whole metaphase chromosomes, which can be challenging for some clinical samples. Here we introduce long fragment read (LFR) technology, a process that enables genome sequencing and haplotyping at a clinically relevant cost, quality and scale.

LFR technology

The LFR approach can generate long-range phased variants because it is conceptually similar to single-molecule sequencing of fragments 10–1,000 kb²⁵ in length. This is achieved by the stochastic separation of corresponding long parental DNA fragments into physically distinct pools followed by subsequent fragmentation to generate shorter sequencing templates (Fig. 1). The same principles are used in aliquoting fosmid clones^11,13. As the fraction of the genome in each pool decreases to less than a haploid genome, the statistical likelihood of having a corresponding fragment from both parental chromosomes in the same pool markedly diminishes²⁵. For example, 0.1 genome equivalents (300 Mb) per well yields an approximately 10% chance that two fragments will overlap, and a 50% chance that those fragments will be derived from separate parental chromosomes. The end result is a roughly 5% overall chance that a particular well will be uninformative for a given fragment. Likewise, the more individual pools interrogated the greater the number of times a fragment from the maternal and paternal homologues will be analysed in separate pools. The current version of LFR uses a 384-well plate with 10–20% of a haploid genome in each well, yielding a theoretical 19–38× physical coverage of both the maternal and paternal alleles of each fragment (see Supplementary Materials and Supplementary Table 1 for an explanation of how this amount of material was selected). This high initial DNA redundancy of 19–38× versus recently described strategies using fosmid pools of 3× (ref. 11) or 6× (ref. 13) ensures complete genome coverage and higher variant calling and phasing accuracy.

To prepare LFR libraries in a high-throughput manner we developed an automated process that performs all LFR-specific steps in the same 384-well plate. First, a highly uniform amplification using a modified, Φ29 polymerase-based, multiple displacement amplification (MDA)²⁶ is performed to replicate each fragment about 10,000 times. Next, through a process of five enzymatic steps within each well, without intervening purification steps, DNA is fragmented and ligated with barcode adapters. In brief, long DNA molecules are processed to blunt-ended 300–1,500-bp fragments through the new process of controlled random enzymatic fragmenting (Supplementary Methods and Supplementary Figs 2 and 3). Unique 10-base Reed–Solomon²⁷ error-correcting barcode adapters (Supplementary Fig. 4) are then ligated to fragmented DNA in each well using a high yield, low chimaera formation protocol¹⁰. Lastly, all 384 wells are combined and an unsaturated PCR using primers common to the ligated adapters is used to generate sufficient template for massively parallel short-read sequencing platforms (see Supplementary Methods). The addition of the LFR pre-processing steps to the standard library process adds at present about US$100 to the reagent cost of our genome sequencing (Supplementary Table 2).

LFR libraries from 10 cells or 100 pg of isolated DNA

As a demonstration of the power of LFR to determine an accurate diploid genome sequence, we generated three libraries of Yoruban female HapMap sample NA19240, six libraries from European HapMap pedigree 1463 (Supplementary Fig. 5), and a single library from Personal Genome Project sample NA20431. Pedigree 1463 and NA19240 have been extensively studied in the HapMap Project^28,29, the 1,000 Genomes Project³⁰ and our own efforts (http://www.completegenomics.com/sequence-data/download-data/). As a result, highly accurate haplotype information can be generated for these samples based on the redundant sequence data from familial samples. One NA19240 LFR library was made from 10 cells of the corresponding immortalized B-cell line, all other libraries were made from an estimated 100–130 pg (equivalent to 15–20 cells) of denatured high molecular mass genomic DNA (Supplementary Fig. 6 and Supplementary Methods). Libraries were analysed using the sequencing platform of Complete Genomics¹⁰. Thirty-five-base mate-paired reads were mapped to the reference genome using a custom alignment algorithm^10,31, yielding on average more than 250 gigabases (Gb) of mapped data with an average genomic coverage of >80× (Table 1 and Supplementary Table 3). Analysis of the mapped LFR data shows two distinct characteristics attributable to MDA: slight underrepresentation of GC-rich sequences (Supplementary Fig. 7) and an increase in chimaeric sequences (Supplementary Table 3). In addition, coverage normalized across 100-kb windows was more variable (Supplementary Fig. 8). Nevertheless, almost all genomic regions were covered with sufficient reads (five or more) demonstrating that 10,000-fold MDA amplification by our optimized protocol can be used for comprehensive genome sequencing. Barcodes were used to group mapped reads based on their physical well location within each library, resulting in sparse regions of coverage interspersed between long spans with almost no read coverage (Supplementary Fig. 9). Each of these discrete regions of coverage represents a physical DNA fragment. On average, each well contained 10–20% of a haploid genome (300–600 Mb), in fragments ranging from 10 kb to more than 300 kb in length with N50s of ∼60 kb (Table 1). Initial fragment coverage was very uniform between chromosomes (Supplementary Fig. 10). As estimated from all detected fragments, the total amounts of DNA used to make the two NA19240 libraries from extracted DNA were ∼62 pg and 84 pg (equivalent to 9.4 and 12.7 cells, respectively). This is less than the expected 100–130 pg, indicating some lost or undetected DNA or imprecision in DNA quantification. Notably, the 10-cell library seemed to be made from ∼90 pg (13.6 cells) of DNA, probably due to some of the cells being in S phase during isolation (Table 1).

Table 1 Comparison of haplotyping performance between different genome assemblies

Full size table

LFR haplotyping results

To ensure complete representation of the genome we maximized the input of DNA fragments for a given read coverage and number of aliquots (Supplementary Materials and Supplementary Table 1). Unlike other experimental approaches^11,13,20, this resulted in low-coverage read data (<2×) for each fragment in each of the ∼40 wells a fragment is found in. This type of data is not useful for defining haplotypes for each initial fragment and required the development of a new phasing algorithm that statistically combines reads from related fragments found in separate aliquots (Supplementary Methods and Fig. 2). Application of our algorithm to the LFR libraries resulted in the placement of on average 92% of the phasable heterozygous SNPs into long contigs with N50s of ∼1 Mb and ∼500 kb for the NA19240 and European samples, respectively (Table1 and Supplementary Table 4). The large reduction in the N50 contig size for European samples can be explained by many more regions of low heterozygosity (RLHs) found in these genomes (Supplementary Tables 5–7, Supplementary Fig. 11 and Supplementary Materials). Doubling the number of reads to ∼160× coverage or combining replicate samples (a total of 768 independent wells), each with ∼80× coverage, pushed the phasing rate to ∼96% (Table 1). Using only the SNP loci called in the LFR library for phasing resulted mostly in a reduction in the total number of phased SNPs by 5–15% (Table 1 and Supplementary Materials). Importantly, the 1.72 million heterozygous SNPs called and phased by the NA12892 LFR library alone was slightly higher than the number of SNPs phased for a comparable sample using a fosmid approach^13,20 (Table 1). For NA19240, the 10-cell library phased more than 98% of the variants phased by the two libraries made from isolated DNA, demonstrating that LFR can be successful starting from a small number of cells.

Figure 2: **LFR haplotyping algorithm.**

LFR reproducibility and phasing error rate analysis

To test LFR reproducibility we compared haplotype data between the two NA19240 replicate libraries. In general, the libraries were very concordant, with only 64 differences per library in ∼2.2 million heterozygous SNPs phased by both libraries (Supplementary Table 8) or 1 of this error type in 44 Mb. LFR was also highly accurate when compared with the conservative but accurate whole-chromosome phasing generated from the parental genomes NA19238 and NA19239 previously sequenced by multiple methods (refs 28, 29 and http://www.completegenomics.com/sequence-data/download-data/; Supplementary Table 4). Only ∼60 instances in 1.57 million comparable individual loci were found in which LFR phased a variant inconsistent with that of the parental haplotyping (false phasing rate of 0.002% if half of discordances are due to sequencing errors in parental genomes). The LFR data also contained ∼135 contigs per library (2.2%), with one or more flipped haplotype blocks (Supplementary Table 8). Extending these analyses to the European replicate libraries of sample NA12877 (Supplementary Table 8) and comparing them with a recent high quality family-based analysis¹⁸ yielded similar results assuming each method contributes half of the observed discordance (Supplementary Table 9). In both NA19240 and NA12877 libraries several contigs had dozens of flipped segments. Most of these contigs were located in RLHs, low read coverage regions, or repetitive regions observed in an unexpectedly large number of wells (for example, subtelomeric or centromeric regions). Most of these errors can be corrected by forcing the LFR phasing algorithms to end contigs in these regions. Alternatively, these errors can be removed with the simple, low cost addition of standard high density array genotype data (∼1 million or greater SNPs) from at least one parent to the LFR assembly. We found that parental genotypes can connect 98% of LFR-phased heterozygous SNPs in full chromosome haplotypes. Furthermore, this data allows haplotypes to be assigned to maternal and paternal lineages; information that is crucial for incorporating parental imprinting in genetic diagnoses in any experimental haplotyping approach. If parental data are unavailable, population genotype data could also be used to connect many of these LFR contigs, although at the cost of increased phasing errors¹⁷.

Phasing de novo mutations

As a demonstration of the completeness and accuracy of our diploid genome sequencing we assessed phasing of 35 de novo mutations recently reported in the genome of NA19240 (ref. 32). Thirty-four of these mutations were called in either the standard genome or one of the LFR libraries. Of those, 32 de novo mutations were phased (16 coming from each parent) in at least one of the two replicate LFR libraries (Supplementary Table 10). Not surprisingly, the two non-phased variants reside in RLHs. Of these 32 variants, 21 were phased previously³² and 18 were consistent with LFR phasing results (M. Hurles, personal communication). The three discordances are probably due to errors in the previous study (M. Hurles, personal communication) confirming LFR accuracy, but not affecting the substantive conclusions of the report.

Error reduction for accurate sequencing from 10 cells

Substantial error rates (∼1 single nucleotide variants (SNV) in 100–1,000 called kilobases) are a common attribute of all current massively parallelized sequencing technologies^{2,3,4,5,6,7,8,9,10,12}. These rates are probably too high for diagnostic use and complicate many studies searching for new mutations. The vast majority of errors are no more likely to occur on the maternal or paternal chromosome. This lack of consistent phasing or presence in only a few aliquots can be exploited by LFR to eliminate these errors from the final assembled haplotypes. To demonstrate this we defined a set of heterozygous SNPs in the NA19240 and NA12877 LFR libraries that were reported with high confidence in each of the individual’s parents as matching the human reference genome at both alleles. There were about 44,000 of these heterozygous SNPs in NA19240 and 30,000 in NA12877 that met this criterion (85% sensitivity). By virtue of their nonexistence in the parental genomes these variations are de novo mutations, cell-line-specific somatic mutations, or false positive variants. Approximately 1,000—1,500 of these variants were reproducibly phased in each of the two replicate LFR libraries from samples NA19240 and NA12877 (Supplementary Table 11). These numbers are similar to those previously reported for de novo and cell-line-specific mutations in NA19240 (ref. 32). The remaining variants are likely to be initial false positives of which only about 500 are phased per library. This represents a 60-fold reduction of the false positive rate in those variations that are phased. Only ∼2,400 of these false variants are present in the standard libraries, of which only ∼260 are phased (<1 false positive SNV in 20 Mb; 5,700 haploid megabases per 260 errors). Each LFR library exhibits a 15-fold increase, compared with a genome sequenced by the standard process, in library-specific false positive calls before phasing. Most of these false positive SNVs are likely to have been introduced by MDA; sampling of rare cell-line variants may be responsible for a smaller percentage. Despite making LFR libraries from 100 pg of DNA and introducing a large number of errors through MDA amplification, applying the LFR phasing algorithm described above reduces the overall sequencing error rate to 99.99999% (∼600 false heterozygous SNVs per 5.7 Gb), approximately 10-fold lower than the previous published error rates using the same ligation-based sequencing chemistry¹⁸. These accurate haplotypes allow detection of highly diverged human sequences (Supplementary Materials and Supplementary Table 13) and many other applications.

Many genes have inactivating variations in both alleles

To demonstrate how LFR could be used in a diagnostic/prognostic environment we analysed the coding SNP data of all libraries for two or more nonsense, splice site or PolyPhen2 (ref. 33) predicted detrimental missense variations that co-occur in the same gene. Of these, approximately 40 genes were found in each individual that contained at least one detrimental variation in each allele (Table 2). Extending this analysis to variants that disrupt transcription factor-binding sites (TFBS) introduces a further ∼100 genes per individual (additional analyses of the effects of TFBS disruption on allele-specific expression can be found in Supplementary Materials and Supplementary Table 12). Owing to the high accuracy of LFR it is unlikely that these variants are a result of sequencing errors and many could have been introduced in the propagation of these cell lines. Furthermore, some of these variants are likely to have little to no effect on the function of these gene products³⁴ and much more work is required to understand how changes in TFBS affect transcription. A few of these variants were found in unrelated individuals, suggesting that they could be improperly annotated or the result of a systematic mapping or reference error. The genome of NA19240 contained a further ∼10 genes predicted to have complete loss of function; this is most likely due to biases introduced by using a European reference genome to annotate an African genome. Nonetheless, these numbers are similar to those found in several recent studies on individual genomes^13,34,35, and suggest that most generally healthy individuals probably have a small number of genes, not absolutely required for normal life, which encode ineffective protein products. Further studies are required to understand the meaning of these types of change. Importantly, we have demonstrated that LFR is able to identify genes in which two detrimental variants are found in different alleles without the need for costly verification³⁴. This information is crucial for effective clinical interpretation of patient genomes.

Table 2 Number of genes with multiple detrimental variations.

Full size table

Discussion

In this study we have demonstrated the efficiency of LFR to accurately phase up to 97% of all detected heterozygous SNPs in a genome into long contiguous stretches of DNA (N50s 400–1,500 kb in length). Even LFR libraries phased without candidate heterozygous SNPs from standard libraries, and thus using only 10–20 human cells, are able to phase 91–97% of the available SNPs. In several instances, the LFR libraries used in this paper had less than optimal starting input DNA (NA20431, Table1). Phasing rate improvements seen by combining two replicate libraries or starting with more DNA (NA12892, Table 1) agree with this conclusion. Furthermore, underrepresentation of GC-rich sequences resulted in less of the genome being called (Supplementary Table 3). Improvements to the MDA process, removal of amplification steps as future single molecule sequencing processes improve, or modifications to how we perform base and variant calling in LFR libraries will help to increase the coverage in these regions (see Supplementary Materials and Supplementary Fig. 12 for a demonstration of how LFR can make calls in low coverage regions). Moreover, as the cost of whole-genome sequencing continues to fall, higher coverage libraries, demonstrated in this paper to markedly improve call rates and phasing, will become more affordable.

A consensus haploid sequence is sufficient for many applications; however, it lacks two very important pieces of data for detecting disease causing variants in personal genomes: phased heterozygous variants and the identification of false positive and negative variant calls. By providing sequence data from both the maternal and paternal chromosomes independently, LFR is able to detect regions in the genome assembly in which only one allele has been covered. Likewise, false positive calls are avoided because LFR independently, in separate aliquots, sequences both the maternal and paternal chromosomes 10–40 times. The result is a statistically low probability that random sequencing or DNA amplification errors would repeatedly occur in several aliquots at the same base position on one parental allele. Thus, LFR allows for the first time, to our knowledge, both accurate and cost-effective sequencing of a genome from a few human cells in spite of the required extensive DNA amplification. Furthermore, by phasing SNPs over hundreds of kilobases (or over entire chromosomes by integrating LFR with routine genotyping of at least one parent), LFR is able to more accurately predict the effects of compound regulatory variants and parental imprinting on allele-specific gene expression and function in various tissue types. Additionally, separation of mate-pair reads by haplotype may also help to detect expanded trinucleotide repeats in diseases such as Huntington’s disease, even though LFR does not provide direct length measure of these or similar repeats. Taken together, this provides a highly accurate report about the potential genomic changes that could cause gain or loss of protein function. This kind of information, obtained inexpensively for every patient, will be crucial for clinical use of genomic data. Moreover, successful and affordable diploid sequencing of a human genome starting from ten cells opens the possibility for comprehensive and accurate genetic screening of micro-biopsies from diverse tissue sources such as circulating tumour cells or pre-implantation embryos generated through in vitro fertilization.

Methods Summary

High molecular mass DNA was purified from cell lines GM12877, GM12878, GM12885, GM12886, GM12891, GM12892 GM19240 and GM20431 (Coriell Institute for Medical Research) using a RecoverEase DNA isolation kit (Agilent) following the manufacturer’s protocol. Individual cells of NA19240 were isolated under ×200 magnification with a micromanipulator (Eppendorf) and deposited into a 1.5-ml microtube with 10 μl of distilled H₂O. LFR libraries were made as outlined in the text; a more detailed description can be found in the Supplementary Methods. LFR libraries were sequenced, mapped and assembled using the sequencing pipeline of Complete Genomics. Phasing was performed using custom haplotyping algorithms as described in Fig. 2 and in further detail in the Supplementary Methods. Variations adversely affecting protein function or expression were found using several methods. Missense variations were analysed using Polyphen2 (ref. 33). For this study both ‘possibly damaging’ and ‘probably damaging’ were considered to be detrimental to protein function, as were all nonsense mutations. Variations determined to adversely affect messenger RNA splicing were found with a custom algorithm based on consensus splice position models from Steve Mount’s database (http://www.life.umd.edu/labs/mount/RNAinfo). JASPAR models^36,37 were used to extract potential TFBSs from the reference genome with mast (http://meme.sdsc.edu/meme/mast-intro.html). Variations falling with these regions were compared with the models to determine what affect they had on transcription factor binding. Genes found to have two or more detrimental mutations were further analysed only if all mutations were found within the same haplotype contig. More detailed descriptions of all methods used in this paper can be found in the Supplementary Methods.

Accession codes

Primary accessions

Sequence Read Archive

SRP012316

Data deposits

Tagged read data has been deposited with the NCBI short-read archive under accession number SRP012316 All sequence data and haplotype information for LFR libraries generated in this study are also available at http://www.completegenomics.com/LFR.

References

Human. genome: Genomes by the thousand. Nature 467, 1026–1027 (2010)
Wheeler, D. A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008)
Article ADS CAS PubMed Google Scholar
Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008)
Article ADS CAS PubMed PubMed Central Google Scholar
Ley, T. J. et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66–72 (2008)
Article ADS CAS PubMed PubMed Central Google Scholar
Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008)
Article ADS CAS PubMed PubMed Central Google Scholar
Ahn, S. M. et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 19, 1622–1629 (2009)
Article CAS PubMed PubMed Central Google Scholar
Kim, J. I. et al. A highly annotated whole-genome sequence of a Korean individual. Nature 460, 1011–1015 (2009)
Article ADS CAS PubMed PubMed Central Google Scholar
McKernan, K. J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 19, 1527–1541 (2009)
Article CAS PubMed PubMed Central Google Scholar
Pushkarev, D., Neff, N. F. & Quake, S. R. Single-molecule sequencing of an individual human genome. Nature Biotechnol. 27, 847–850 (2009)
Article CAS Google Scholar
Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010)
Article ADS CAS PubMed Google Scholar
Kitzman, J. O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nature Biotechnol. 29, 59–63 (2011)
Article CAS Google Scholar
Rothberg, J. M. et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature 475, 348–352 (2011)
Article CAS PubMed Google Scholar
Suk, E. K. et al. A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res. 21, 1672–1685 (2011)
Article CAS PubMed PubMed Central Google Scholar
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001)
Article ADS CAS PubMed Google Scholar
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001)
Article ADS CAS PubMed Google Scholar
Tewhey, R., Bansal, V., Torkamani, A., Topol, E. J. & Schork, N. J. The importance of phase information for human genomics. Nature Rev. Genet. 12, 215–223 (2011)
Article CAS PubMed Google Scholar
Browning, S. R. & Browning, B. L. Haplotype phasing: existing methods and new developments. Nature Rev. Genet. 12, 703–714 (2011)
Article CAS PubMed Google Scholar
Roach, J. C. et al. Chromosomal haplotypes by genetic phasing of human families. Am. J. Hum. Genet. 89, 382–397 (2011)
Article CAS PubMed PubMed Central Google Scholar
Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007)
Article PubMed PubMed Central Google Scholar
Duitama, J. et al. Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques. Nucleic Acids Res. 40, 2041–2053 (2012)
Article CAS PubMed Google Scholar
Zhang, K. et al. Long-range polony haplotyping of individual human chromosome molecules. Nature Genet. 38, 382–387 (2006)
Article CAS PubMed Google Scholar
Ma, L. et al. Direct determination of molecular haplotypes by chromosome microdissection. Nature Methods 7, 299–301 (2010)
Article CAS PubMed PubMed Central Google Scholar
Fan, H. C., Wang, J., Potanina, A. & Quake, S. R. Whole-genome molecular haplotyping of single cells. Nature Biotechnol. 29, 51–57 (2011)
Article CAS Google Scholar
Yang, H., Chen, X. & Wong, W. H. Completely phased genome sequencing through chromosome sorting. Proc. Natl Acad. Sci. USA 108, 12–17 (2011)
Article ADS CAS PubMed Google Scholar
Drmanac, R. Nucleic acid analysis by random mixtures of non-overlapping fragments. US patent 7,901. 891 (2006)
Dean, F. B. et al. Comprehensive human genome amplification using multiple displacement amplification. Proc. Natl Acad. Sci. USA 99, 5261–5266 (2002)
Article ADS CAS PubMed PubMed Central Google Scholar
Kermani, B. G. & Shannon, K. W. Method and apparatus for quantification of DNA sequencing quality and construction of a characterizable model system using Reed–Solomon codes. US patent PCT/US2010/023083. (2010)
The International HapMap Consortium A haplotype map of the human genome. Nature 437, 1299–1320 (2005)
Article ADS PubMed Central Google Scholar
Frazer, K. A. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007)
Article ADS CAS PubMed Google Scholar
The 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010)
Article PubMed Central Google Scholar
Carnevali, P. et al. Computational techniques for human genome resequencing using mated gapped reads. J. Comput. Biol. 19, 279–292 (2011)
Article MathSciNet PubMed Google Scholar
Conrad, D. F. et al. Variation in genome-wide mutation rates within and between human families. Nature Genet. 43, 712–714 (2011)
Article CAS PubMed Google Scholar
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nature Methods 7, 248–249 (2010)
Article CAS PubMed PubMed Central Google Scholar
MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012)
Article ADS CAS PubMed PubMed Central Google Scholar
Lohmueller, K. E. et al. Proportionally more deleterious genetic variation in European than in African populations. Nature 451, 994–997 (2008)
Article ADS CAS PubMed PubMed Central Google Scholar
Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W. W. & Lenhard, B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32, D91–D94 (2004)
Article CAS PubMed PubMed Central Google Scholar
Bryne, J. C. et al. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res. 36, D102–D106 (2008)
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We would like to acknowledge the continuing contributions and support of all Complete Genomics employees, in particular M. McElwain, D. Bailey, D. Kruse and J. Turcotte for their help with preparing the manuscript. We also wish to thank W. Chao for his help with Figures 1 and 2. Some of this work was supported by the US Department of Commerce, National Institute of Standards and Technology, Advanced Technology Program, Cooperative Agreement Number 70NANB7H7027 and National Institutes of Health grant P50HG005550. We would like to thank J. Chen for managing the NIST grant.

Author information

Andrew B. Sparks & Fredrik Dahl
Present address: Present addresses: Aria Diagnostics, 5945 Optical Court, San Jose, California 95138, USA (A.B.S.); Halo Genomics, Dag Hammarskjolds vag 54A, 751 83 Uppsala, Sweden (F.D.).,
Brock A. Peters and Bahram G. Kermani: These authors contributed equally to this work.

Authors and Affiliations

Complete Genomics, Inc., 2071 Stierlin Court, Mountain View, California 94043, USA ,
Brock A. Peters, Bahram G. Kermani, Andrew B. Sparks, Oleg Alferov, Peter Hong, Andrei Alexeev, Yuan Jiang, Fredrik Dahl, Y. Tom Tang, Juergen Haas, Joseph E. Peterson, Helena Perazich, George Yeung, Jia Liu, Linsu Chen, Michael I. Kennemer, Kaliprasad Pothuraju, Karel Konvicka, Mike Tsoupko-Sitnikov, Krishna P. Pant, Jessica C. Ebert, Geoffrey B. Nilsen, Jonathan Baccash, Aaron L. Halpern & Radoje Drmanac
Department of Genetics, Harvard Medical School, Cambridge, 02115, Massachusetts, USA
Kimberly Robasky, Alexander Wait Zaranek, Je-Hyuk Lee, Madeleine Price Ball & George M. Church
Program in Bioinformatics, Boston University, Boston, 02215, Massachusetts, USA
Kimberly Robasky
Wyss Institute for Biologically Inspired Engineering, Harvard Medical School, Cambridge, 02115, Massachusetts, USA
Je-Hyuk Lee

Authors

Brock A. Peters
View author publications
You can also search for this author in PubMed Google Scholar
Bahram G. Kermani
View author publications
You can also search for this author in PubMed Google Scholar
Andrew B. Sparks
View author publications
You can also search for this author in PubMed Google Scholar
Oleg Alferov
View author publications
You can also search for this author in PubMed Google Scholar
Peter Hong
View author publications
You can also search for this author in PubMed Google Scholar
Andrei Alexeev
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Fredrik Dahl
View author publications
You can also search for this author in PubMed Google Scholar
Y. Tom Tang
View author publications
You can also search for this author in PubMed Google Scholar
Juergen Haas
View author publications
You can also search for this author in PubMed Google Scholar
Kimberly Robasky
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Wait Zaranek
View author publications
You can also search for this author in PubMed Google Scholar
Je-Hyuk Lee
View author publications
You can also search for this author in PubMed Google Scholar
Madeleine Price Ball
View author publications
You can also search for this author in PubMed Google Scholar
Joseph E. Peterson
View author publications
You can also search for this author in PubMed Google Scholar
Helena Perazich
View author publications
You can also search for this author in PubMed Google Scholar
George Yeung
View author publications
You can also search for this author in PubMed Google Scholar
Jia Liu
View author publications
You can also search for this author in PubMed Google Scholar
Linsu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Michael I. Kennemer
View author publications
You can also search for this author in PubMed Google Scholar
Kaliprasad Pothuraju
View author publications
You can also search for this author in PubMed Google Scholar
Karel Konvicka
View author publications
You can also search for this author in PubMed Google Scholar
Mike Tsoupko-Sitnikov
View author publications
You can also search for this author in PubMed Google Scholar
Krishna P. Pant
View author publications
You can also search for this author in PubMed Google Scholar
Jessica C. Ebert
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey B. Nilsen
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Baccash
View author publications
You can also search for this author in PubMed Google Scholar
Aaron L. Halpern
View author publications
You can also search for this author in PubMed Google Scholar
George M. Church
View author publications
You can also search for this author in PubMed Google Scholar
Radoje Drmanac
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.A.P., B.G.K., A.B.S. and R.D. conceived the study. B.A.P., B.G.K., R.D., O.A., Y.T.T., J.H., J.C.E., J.B., A.L.H. and G.B.N. performed analyses. B.A.P., A.B.S., P.H., A.A., Y.J., F.D., J.E.P., H.P., G.Y., J.L. and L.C. developed the laboratory processes and generated the LFR libraries. K.K., M.T.-S. and K.P.P. developed the basecaller and parts of the analysis pipeline. M.I.K. formatted, managed and uploaded data to the public archives. K.R., A.W.Z., J.-H.L., M.P.B. and G.M.C. generated and analysed the RNA sequencing data. B.A.P., B.G.K. and R.D. coordinated the study and wrote the paper. All authors contributed to revision and review of the manuscript.

Corresponding authors

Correspondence to Brock A. Peters or Radoje Drmanac.

Ethics declarations

Competing interests

Employees of Complete Genomics have stock options in the company; Complete Genomics has filed several patents on this work.

Supplementary information

Supplementary Information

This file contains Supplementary Figures 1-12, Supplementary Material with additional references, Supplementary Methods with additional Figures 1-14 and Supplementary Tables 1-13. (PDF 2677 kb)

PowerPoint slides

PowerPoint slide for Fig. 1

PowerPoint slide for Fig. 2

Rights and permissions

This article is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence (http://creativecommons.org/licenses/by-nc-sa/3.0/), which permits distribution, and reproduction in any medium, provided the original author and source are credited. This licence does not permit commercial exploitation, and derivative works must be licensed under the same or similar licence.

Reprints and permissions

About this article

Cite this article

Peters, B., Kermani, B., Sparks, A. et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487, 190–195 (2012). https://doi.org/10.1038/nature11236

Download citation

Received: 24 January 2012
Accepted: 15 May 2012
Published: 11 July 2012
Issue Date: 12 July 2012
DOI: https://doi.org/10.1038/nature11236

This article is cited by

Noninvasive prenatal diagnosis of monogenic disorders based on direct haplotype phasing through targeted linked-read sequencing
- Chao Chen
- Min Chen
- Jun Sun
BMC Medical Genomics (2021)
Noninvasive prenatal testing of α-thalassemia and β-thalassemia through population-based parental haplotyping
- Chao Chen
- Ru Li
- Can Liao
Genome Medicine (2021)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.