Recent advances in whole-genome sequencing have brought the vision of personal genomics and genomic medicine closer to reality. However, current methods lack clinical accuracy and the ability to describe the context (haplotypes) in which genome variants co-occur in a cost-effective manner. Here we describe a low-cost DNA sequencing and haplotyping process, long fragment read (LFR) technology, which is similar to sequencing long single DNA molecules without cloning or separation of metaphase chromosomes. In this study, ten LFR libraries were made using only ∼100 picograms of human DNA per sample. Up to 97% of the heterozygous single nucleotide variants were assembled into long haplotype contigs. Removal of false positive single nucleotide variants not phased by multiple LFR haplotypes resulted in a final genome error rate of 1 in 10 megabases. Cost-effective and accurate genome sequencing and haplotyping from 10–20 human cells, as demonstrated here, will enable comprehensive genetic studies and diverse clinical applications.
The extraordinary advancements made in DNA sequencing technologies over the past few years have led to the elucidation of ∼10,000 (refs 1–13) individual human genomes (30× or greater base coverage) from different ethnicities and using different technologies2,3,4,5,6,7,8,9,10,11,12,13 and at a fraction of the cost10 of sequencing the original human reference genome14,15. Although this is a monumental achievement, the vast majority of these genomes have excluded a very important element of human genetics. Individual human genomes are diploid in nature, with half of the homologous chromosomes being derived from each parent. The context in which variations occur on each individual chromosome can have profound effects on the expression and regulation of genes and other transcribed regions of the genome16. Furthermore, determining whether two potentially detrimental mutations occur within one or both alleles of a gene is of paramount clinical importance.
Almost all recent human genome sequencing has been performed on short read length (<200 base pairs (bp)), highly parallelized systems starting with hundreds of nanograms of DNA. These technologies are excellent at generating large volumes of data quickly and economically. Unfortunately, short reads, often paired with small mate-gap sizes (500 bases–10 kilobases (kb)), eliminate most single nucleotide polymorphism (SNP) phase information beyond a few kilobases8. Population-based genotype data has been used to successfully assemble short-read data into long haplotype blocks3, but these methods suffer from higher error rates and have difficulty phasing rare variants17. Although using pedigree information18 or combining it with population data provides further phasing power, no combination of these methods is able to resolve de novo mutations17.
At present, four personal genomes—J. Craig Venter19, a Gujarati Indian (HapMap sample NA20847)11, and two Europeans (Max Planck One13 and HapMap Sample NA12878 (ref. 20))—have been sequenced and assembled as diploid. All have involved cloning long DNA fragments in a process similar to that used for the construction of the human reference genome14,15. Although these processes generate long-phased contigs (N50 values (50% of the covered bases are found within contigs longer than this number) of 350 kb19, 386 kb11 and 1 megabase (Mb)13, and full-chromosome haplotypes in combination with parental genotypes20) they require a large amount of initial DNA, extensive library processing, and are currently too expensive11 to use in a routine clinical environment. Furthermore, several reports have recently demonstrated whole chromosome haplotyping through direct isolation of metaphase chromosomes21,22,23,24. These methods have yet to be used for whole-genome sequencing and require preparation and isolation of whole metaphase chromosomes, which can be challenging for some clinical samples. Here we introduce long fragment read (LFR) technology, a process that enables genome sequencing and haplotyping at a clinically relevant cost, quality and scale.
The LFR approach can generate long-range phased variants because it is conceptually similar to single-molecule sequencing of fragments 10–1,000 kb25 in length. This is achieved by the stochastic separation of corresponding long parental DNA fragments into physically distinct pools followed by subsequent fragmentation to generate shorter sequencing templates (Fig. 1). The same principles are used in aliquoting fosmid clones11,13. As the fraction of the genome in each pool decreases to less than a haploid genome, the statistical likelihood of having a corresponding fragment from both parental chromosomes in the same pool markedly diminishes25. For example, 0.1 genome equivalents (300 Mb) per well yields an approximately 10% chance that two fragments will overlap, and a 50% chance that those fragments will be derived from separate parental chromosomes. The end result is a roughly 5% overall chance that a particular well will be uninformative for a given fragment. Likewise, the more individual pools interrogated the greater the number of times a fragment from the maternal and paternal homologues will be analysed in separate pools. The current version of LFR uses a 384-well plate with 10–20% of a haploid genome in each well, yielding a theoretical 19–38× physical coverage of both the maternal and paternal alleles of each fragment (see Supplementary Materials and Supplementary Table 1 for an explanation of how this amount of material was selected). This high initial DNA redundancy of 19–38× versus recently described strategies using fosmid pools of 3× (ref. 11) or 6× (ref. 13) ensures complete genome coverage and higher variant calling and phasing accuracy.
To prepare LFR libraries in a high-throughput manner we developed an automated process that performs all LFR-specific steps in the same 384-well plate. First, a highly uniform amplification using a modified, Φ29 polymerase-based, multiple displacement amplification (MDA)26 is performed to replicate each fragment about 10,000 times. Next, through a process of five enzymatic steps within each well, without intervening purification steps, DNA is fragmented and ligated with barcode adapters. In brief, long DNA molecules are processed to blunt-ended 300–1,500-bp fragments through the new process of controlled random enzymatic fragmenting (Supplementary Methods and Supplementary Figs 2 and 3). Unique 10-base Reed–Solomon27 error-correcting barcode adapters (Supplementary Fig. 4) are then ligated to fragmented DNA in each well using a high yield, low chimaera formation protocol10. Lastly, all 384 wells are combined and an unsaturated PCR using primers common to the ligated adapters is used to generate sufficient template for massively parallel short-read sequencing platforms (see Supplementary Methods). The addition of the LFR pre-processing steps to the standard library process adds at present about US$100 to the reagent cost of our genome sequencing (Supplementary Table 2).
LFR libraries from 10 cells or 100 pg of isolated DNA
As a demonstration of the power of LFR to determine an accurate diploid genome sequence, we generated three libraries of Yoruban female HapMap sample NA19240, six libraries from European HapMap pedigree 1463 (Supplementary Fig. 5), and a single library from Personal Genome Project sample NA20431. Pedigree 1463 and NA19240 have been extensively studied in the HapMap Project28,29, the 1,000 Genomes Project30 and our own efforts (http://www.completegenomics.com/sequence-data/download-data/). As a result, highly accurate haplotype information can be generated for these samples based on the redundant sequence data from familial samples. One NA19240 LFR library was made from 10 cells of the corresponding immortalized B-cell line, all other libraries were made from an estimated 100–130 pg (equivalent to 15–20 cells) of denatured high molecular mass genomic DNA (Supplementary Fig. 6 and Supplementary Methods). Libraries were analysed using the sequencing platform of Complete Genomics10. Thirty-five-base mate-paired reads were mapped to the reference genome using a custom alignment algorithm10,31, yielding on average more than 250 gigabases (Gb) of mapped data with an average genomic coverage of >80× (Table 1 and Supplementary Table 3). Analysis of the mapped LFR data shows two distinct characteristics attributable to MDA: slight underrepresentation of GC-rich sequences (Supplementary Fig. 7) and an increase in chimaeric sequences (Supplementary Table 3). In addition, coverage normalized across 100-kb windows was more variable (Supplementary Fig. 8). Nevertheless, almost all genomic regions were covered with sufficient reads (five or more) demonstrating that 10,000-fold MDA amplification by our optimized protocol can be used for comprehensive genome sequencing. Barcodes were used to group mapped reads based on their physical well location within each library, resulting in sparse regions of coverage interspersed between long spans with almost no read coverage (Supplementary Fig. 9). Each of these discrete regions of coverage represents a physical DNA fragment. On average, each well contained 10–20% of a haploid genome (300–600 Mb), in fragments ranging from 10 kb to more than 300 kb in length with N50s of ∼60 kb (Table 1). Initial fragment coverage was very uniform between chromosomes (Supplementary Fig. 10). As estimated from all detected fragments, the total amounts of DNA used to make the two NA19240 libraries from extracted DNA were ∼62 pg and 84 pg (equivalent to 9.4 and 12.7 cells, respectively). This is less than the expected 100–130 pg, indicating some lost or undetected DNA or imprecision in DNA quantification. Notably, the 10-cell library seemed to be made from ∼90 pg (13.6 cells) of DNA, probably due to some of the cells being in S phase during isolation (Table 1).
LFR haplotyping results
To ensure complete representation of the genome we maximized the input of DNA fragments for a given read coverage and number of aliquots (Supplementary Materials and Supplementary Table 1). Unlike other experimental approaches11,13,20, this resulted in low-coverage read data (<2×) for each fragment in each of the ∼40 wells a fragment is found in. This type of data is not useful for defining haplotypes for each initial fragment and required the development of a new phasing algorithm that statistically combines reads from related fragments found in separate aliquots (Supplementary Methods and Fig. 2). Application of our algorithm to the LFR libraries resulted in the placement of on average 92% of the phasable heterozygous SNPs into long contigs with N50s of ∼1 Mb and ∼500 kb for the NA19240 and European samples, respectively (Table1 and Supplementary Table 4). The large reduction in the N50 contig size for European samples can be explained by many more regions of low heterozygosity (RLHs) found in these genomes (Supplementary Tables 5–7, Supplementary Fig. 11 and Supplementary Materials). Doubling the number of reads to ∼160× coverage or combining replicate samples (a total of 768 independent wells), each with ∼80× coverage, pushed the phasing rate to ∼96% (Table 1). Using only the SNP loci called in the LFR library for phasing resulted mostly in a reduction in the total number of phased SNPs by 5–15% (Table 1 and Supplementary Materials). Importantly, the 1.72 million heterozygous SNPs called and phased by the NA12892 LFR library alone was slightly higher than the number of SNPs phased for a comparable sample using a fosmid approach13,20 (Table 1). For NA19240, the 10-cell library phased more than 98% of the variants phased by the two libraries made from isolated DNA, demonstrating that LFR can be successful starting from a small number of cells.
LFR reproducibility and phasing error rate analysis
To test LFR reproducibility we compared haplotype data between the two NA19240 replicate libraries. In general, the libraries were very concordant, with only 64 differences per library in ∼2.2 million heterozygous SNPs phased by both libraries (Supplementary Table 8) or 1 of this error type in 44 Mb. LFR was also highly accurate when compared with the conservative but accurate whole-chromosome phasing generated from the parental genomes NA19238 and NA19239 previously sequenced by multiple methods (refs 28, 29 and http://www.completegenomics.com/sequence-data/download-data/; Supplementary Table 4). Only ∼60 instances in 1.57 million comparable individual loci were found in which LFR phased a variant inconsistent with that of the parental haplotyping (false phasing rate of 0.002% if half of discordances are due to sequencing errors in parental genomes). The LFR data also contained ∼135 contigs per library (2.2%), with one or more flipped haplotype blocks (Supplementary Table 8). Extending these analyses to the European replicate libraries of sample NA12877 (Supplementary Table 8) and comparing them with a recent high quality family-based analysis18 yielded similar results assuming each method contributes half of the observed discordance (Supplementary Table 9). In both NA19240 and NA12877 libraries several contigs had dozens of flipped segments. Most of these contigs were located in RLHs, low read coverage regions, or repetitive regions observed in an unexpectedly large number of wells (for example, subtelomeric or centromeric regions). Most of these errors can be corrected by forcing the LFR phasing algorithms to end contigs in these regions. Alternatively, these errors can be removed with the simple, low cost addition of standard high density array genotype data (∼1 million or greater SNPs) from at least one parent to the LFR assembly. We found that parental genotypes can connect 98% of LFR-phased heterozygous SNPs in full chromosome haplotypes. Furthermore, this data allows haplotypes to be assigned to maternal and paternal lineages; information that is crucial for incorporating parental imprinting in genetic diagnoses in any experimental haplotyping approach. If parental data are unavailable, population genotype data could also be used to connect many of these LFR contigs, although at the cost of increased phasing errors17.
Phasing de novo mutations
As a demonstration of the completeness and accuracy of our diploid genome sequencing we assessed phasing of 35 de novo mutations recently reported in the genome of NA19240 (ref. 32). Thirty-four of these mutations were called in either the standard genome or one of the LFR libraries. Of those, 32 de novo mutations were phased (16 coming from each parent) in at least one of the two replicate LFR libraries (Supplementary Table 10). Not surprisingly, the two non-phased variants reside in RLHs. Of these 32 variants, 21 were phased previously32 and 18 were consistent with LFR phasing results (M. Hurles, personal communication). The three discordances are probably due to errors in the previous study (M. Hurles, personal communication) confirming LFR accuracy, but not affecting the substantive conclusions of the report.
Error reduction for accurate sequencing from 10 cells
Substantial error rates (∼1 single nucleotide variants (SNV) in 100–1,000 called kilobases) are a common attribute of all current massively parallelized sequencing technologies2,3,4,5,6,7,8,9,10,12. These rates are probably too high for diagnostic use and complicate many studies searching for new mutations. The vast majority of errors are no more likely to occur on the maternal or paternal chromosome. This lack of consistent phasing or presence in only a few aliquots can be exploited by LFR to eliminate these errors from the final assembled haplotypes. To demonstrate this we defined a set of heterozygous SNPs in the NA19240 and NA12877 LFR libraries that were reported with high confidence in each of the individual’s parents as matching the human reference genome at both alleles. There were about 44,000 of these heterozygous SNPs in NA19240 and 30,000 in NA12877 that met this criterion (85% sensitivity). By virtue of their nonexistence in the parental genomes these variations are de novo mutations, cell-line-specific somatic mutations, or false positive variants. Approximately 1,000—1,500 of these variants were reproducibly phased in each of the two replicate LFR libraries from samples NA19240 and NA12877 (Supplementary Table 11). These numbers are similar to those previously reported for de novo and cell-line-specific mutations in NA19240 (ref. 32). The remaining variants are likely to be initial false positives of which only about 500 are phased per library. This represents a 60-fold reduction of the false positive rate in those variations that are phased. Only ∼2,400 of these false variants are present in the standard libraries, of which only ∼260 are phased (<1 false positive SNV in 20 Mb; 5,700 haploid megabases per 260 errors). Each LFR library exhibits a 15-fold increase, compared with a genome sequenced by the standard process, in library-specific false positive calls before phasing. Most of these false positive SNVs are likely to have been introduced by MDA; sampling of rare cell-line variants may be responsible for a smaller percentage. Despite making LFR libraries from 100 pg of DNA and introducing a large number of errors through MDA amplification, applying the LFR phasing algorithm described above reduces the overall sequencing error rate to 99.99999% (∼600 false heterozygous SNVs per 5.7 Gb), approximately 10-fold lower than the previous published error rates using the same ligation-based sequencing chemistry18. These accurate haplotypes allow detection of highly diverged human sequences (Supplementary Materials and Supplementary Table 13) and many other applications.
Many genes have inactivating variations in both alleles
To demonstrate how LFR could be used in a diagnostic/prognostic environment we analysed the coding SNP data of all libraries for two or more nonsense, splice site or PolyPhen2 (ref. 33) predicted detrimental missense variations that co-occur in the same gene. Of these, approximately 40 genes were found in each individual that contained at least one detrimental variation in each allele (Table 2). Extending this analysis to variants that disrupt transcription factor-binding sites (TFBS) introduces a further ∼100 genes per individual (additional analyses of the effects of TFBS disruption on allele-specific expression can be found in Supplementary Materials and Supplementary Table 12). Owing to the high accuracy of LFR it is unlikely that these variants are a result of sequencing errors and many could have been introduced in the propagation of these cell lines. Furthermore, some of these variants are likely to have little to no effect on the function of these gene products34 and much more work is required to understand how changes in TFBS affect transcription. A few of these variants were found in unrelated individuals, suggesting that they could be improperly annotated or the result of a systematic mapping or reference error. The genome of NA19240 contained a further ∼10 genes predicted to have complete loss of function; this is most likely due to biases introduced by using a European reference genome to annotate an African genome. Nonetheless, these numbers are similar to those found in several recent studies on individual genomes13,34,35, and suggest that most generally healthy individuals probably have a small number of genes, not absolutely required for normal life, which encode ineffective protein products. Further studies are required to understand the meaning of these types of change. Importantly, we have demonstrated that LFR is able to identify genes in which two detrimental variants are found in different alleles without the need for costly verification34. This information is crucial for effective clinical interpretation of patient genomes.
In this study we have demonstrated the efficiency of LFR to accurately phase up to 97% of all detected heterozygous SNPs in a genome into long contiguous stretches of DNA (N50s 400–1,500 kb in length). Even LFR libraries phased without candidate heterozygous SNPs from standard libraries, and thus using only 10–20 human cells, are able to phase 91–97% of the available SNPs. In several instances, the LFR libraries used in this paper had less than optimal starting input DNA (NA20431, Table1). Phasing rate improvements seen by combining two replicate libraries or starting with more DNA (NA12892, Table 1) agree with this conclusion. Furthermore, underrepresentation of GC-rich sequences resulted in less of the genome being called (Supplementary Table 3). Improvements to the MDA process, removal of amplification steps as future single molecule sequencing processes improve, or modifications to how we perform base and variant calling in LFR libraries will help to increase the coverage in these regions (see Supplementary Materials and Supplementary Fig. 12 for a demonstration of how LFR can make calls in low coverage regions). Moreover, as the cost of whole-genome sequencing continues to fall, higher coverage libraries, demonstrated in this paper to markedly improve call rates and phasing, will become more affordable.
A consensus haploid sequence is sufficient for many applications; however, it lacks two very important pieces of data for detecting disease causing variants in personal genomes: phased heterozygous variants and the identification of false positive and negative variant calls. By providing sequence data from both the maternal and paternal chromosomes independently, LFR is able to detect regions in the genome assembly in which only one allele has been covered. Likewise, false positive calls are avoided because LFR independently, in separate aliquots, sequences both the maternal and paternal chromosomes 10–40 times. The result is a statistically low probability that random sequencing or DNA amplification errors would repeatedly occur in several aliquots at the same base position on one parental allele. Thus, LFR allows for the first time, to our knowledge, both accurate and cost-effective sequencing of a genome from a few human cells in spite of the required extensive DNA amplification. Furthermore, by phasing SNPs over hundreds of kilobases (or over entire chromosomes by integrating LFR with routine genotyping of at least one parent), LFR is able to more accurately predict the effects of compound regulatory variants and parental imprinting on allele-specific gene expression and function in various tissue types. Additionally, separation of mate-pair reads by haplotype may also help to detect expanded trinucleotide repeats in diseases such as Huntington’s disease, even though LFR does not provide direct length measure of these or similar repeats. Taken together, this provides a highly accurate report about the potential genomic changes that could cause gain or loss of protein function. This kind of information, obtained inexpensively for every patient, will be crucial for clinical use of genomic data. Moreover, successful and affordable diploid sequencing of a human genome starting from ten cells opens the possibility for comprehensive and accurate genetic screening of micro-biopsies from diverse tissue sources such as circulating tumour cells or pre-implantation embryos generated through in vitro fertilization.
High molecular mass DNA was purified from cell lines GM12877, GM12878, GM12885, GM12886, GM12891, GM12892 GM19240 and GM20431 (Coriell Institute for Medical Research) using a RecoverEase DNA isolation kit (Agilent) following the manufacturer’s protocol. Individual cells of NA19240 were isolated under ×200 magnification with a micromanipulator (Eppendorf) and deposited into a 1.5-ml microtube with 10 μl of distilled H2O. LFR libraries were made as outlined in the text; a more detailed description can be found in the Supplementary Methods. LFR libraries were sequenced, mapped and assembled using the sequencing pipeline of Complete Genomics. Phasing was performed using custom haplotyping algorithms as described in Fig. 2 and in further detail in the Supplementary Methods. Variations adversely affecting protein function or expression were found using several methods. Missense variations were analysed using Polyphen2 (ref. 33). For this study both ‘possibly damaging’ and ‘probably damaging’ were considered to be detrimental to protein function, as were all nonsense mutations. Variations determined to adversely affect messenger RNA splicing were found with a custom algorithm based on consensus splice position models from Steve Mount’s database (http://www.life.umd.edu/labs/mount/RNAinfo). JASPAR models36,37 were used to extract potential TFBSs from the reference genome with mast (http://meme.sdsc.edu/meme/mast-intro.html). Variations falling with these regions were compared with the models to determine what affect they had on transcription factor binding. Genes found to have two or more detrimental mutations were further analysed only if all mutations were found within the same haplotype contig. More detailed descriptions of all methods used in this paper can be found in the Supplementary Methods.
We would like to acknowledge the continuing contributions and support of all Complete Genomics employees, in particular M. McElwain, D. Bailey, D. Kruse and J. Turcotte for their help with preparing the manuscript. We also wish to thank W. Chao for his help with Figures 1 and 2. Some of this work was supported by the US Department of Commerce, National Institute of Standards and Technology, Advanced Technology Program, Cooperative Agreement Number 70NANB7H7027 and National Institutes of Health grant P50HG005550. We would like to thank J. Chen for managing the NIST grant.
This file contains Supplementary Figures 1-12, Supplementary Material with additional references, Supplementary Methods with additional Figures 1-14 and Supplementary Tables 1-13.